CN116155541A - Automatic machine learning platform and method for network security application - Google Patents

Automatic machine learning platform and method for network security application Download PDF

Info

Publication number
CN116155541A
CN116155541A CN202211635591.0A CN202211635591A CN116155541A CN 116155541 A CN116155541 A CN 116155541A CN 202211635591 A CN202211635591 A CN 202211635591A CN 116155541 A CN116155541 A CN 116155541A
Authority
CN
China
Prior art keywords
sample
model
module
data
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211635591.0A
Other languages
Chinese (zh)
Inventor
陈刚
邓巧华
梁群
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongtong Uniform Chuangfa Science And Technology Co ltd
Original Assignee
Zhongtong Uniform Chuangfa Science And Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongtong Uniform Chuangfa Science And Technology Co ltd filed Critical Zhongtong Uniform Chuangfa Science And Technology Co ltd
Priority to CN202211635591.0A priority Critical patent/CN116155541A/en
Publication of CN116155541A publication Critical patent/CN116155541A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/10Network architectures or network communication protocols for network security for controlling access to devices or network resources
    • H04L63/101Access control lists [ACL]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1425Traffic logging, e.g. anomaly detection
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/50Reducing energy consumption in communication networks in wire-line communication networks, e.g. low power modes or reduced link rate

Abstract

Embodiments of the present disclosure provide an automated machine learning platform and method for network security applications. The platform comprises: the system comprises a data preprocessing module, a sample and label generating module, a characteristic extracting module and a model training module; the data preprocessing module is used for preprocessing the flow data to obtain a behavior sequence corresponding to the user ID; the sample and label generating module is used for generating a sample and a label corresponding to the sample according to the behavior sequence corresponding to the user ID; the feature extraction module is used for extracting features of the sample to obtain sample features; the model training module is used for constructing an initial model by adopting a model structure corresponding to the data format of the sample characteristics, and training the initial model by utilizing the sample characteristics and the corresponding labels thereof to obtain a target model. In this way, user participation can be reduced, machine learning automation is realized, and the use threshold of the user is further reduced.

Description

Automatic machine learning platform and method for network security application
Technical Field
The disclosure relates to the technical field of network security, in particular to an automatic machine learning platform and method for network security application.
Background
With the rapid development of artificial intelligence in recent years, machine learning is also rapidly developed as a main implementation method of the artificial intelligence, and various large internet companies provide own machine learning platforms.
Through the machine learning platform, a reasonable model can be obtained according to the data. The models can solve some problems of scientific research, can be applied to various fields in actual life, and actively guide activities such as life production and the like.
However, the current machine learning platform often needs multiple participation of users to develop model training, and automation cannot be achieved, so that the use threshold of the users is generally higher.
Disclosure of Invention
The present disclosure provides an automated machine learning platform and method for network security applications.
In a first aspect, embodiments of the present disclosure provide an automated machine learning platform for network security applications, the platform comprising: the system comprises a data preprocessing module, a sample and label generating module, a characteristic extracting module and a model training module;
the data preprocessing module is used for preprocessing the flow data to obtain a behavior sequence corresponding to the user ID;
the sample and label generating module is used for generating a sample and a label corresponding to the sample according to the behavior sequence corresponding to the user ID;
the feature extraction module is used for extracting features of the sample to obtain sample features;
the model training module is used for constructing an initial model by adopting a model structure corresponding to the data format of the sample characteristics, and training the initial model by utilizing the sample characteristics and the corresponding labels thereof to obtain a target model.
In some implementations of the first aspect, the preprocessing includes: user identification, entity identification, URL path normalization, behavior sequence extraction.
In some implementations of the first aspect, the sample and tag generation module is specifically configured to:
and generating a positive sample, a negative sample and labels respectively corresponding to the positive sample and the negative sample according to the behavior sequence corresponding to the user ID.
In some implementations of the first aspect, the feature extraction module is specifically configured to:
and adopting a feature extraction strategy corresponding to the data type of the sample to extract the features of the sample, thereby obtaining the features of the sample.
In some implementations of the first aspect, the model training module is specifically configured to:
randomly dividing sample characteristics and labels corresponding to the sample characteristics into a training set and a testing set;
constructing an initial model with different super parameters by adopting a model structure corresponding to the data format of the sample characteristics;
training each initial model by utilizing a training set to obtain a plurality of trained candidate models;
and testing the trained multiple candidate models by using the test set, and selecting the candidate model with the best test effect as the target model.
In some implementations of the first aspect, the model training module is specifically configured to:
up-sampling a minority of sample features in the sample features to generate minority sample features and corresponding labels;
and randomly dividing the sample characteristics and the labels corresponding to the sample characteristics, and the generated minority sample characteristics and the labels corresponding to the sample characteristics into a training set and a testing set.
In some implementations of the first aspect, the platform further includes: a model deployment module;
and the model deployment module is used for deploying the target model on the cloud server so as to call the target model to predict the characteristics to be predicted through the API.
In some implementations of the first aspect, the platform further includes: a model updating module;
the model updating module is used for collecting the prediction result, determining the characteristic of the prediction error according to the prediction result, constructing a data set according to the characteristic of the prediction error and the corresponding real result, and carrying out fine adjustment on the target model according to the data set.
In a second aspect, embodiments of the present disclosure provide a method of automated machine learning for a web-security-oriented application, the method being applied to a web-security-application-oriented automated machine learning platform as described above, comprising:
preprocessing flow data to obtain a behavior sequence corresponding to a user ID;
generating a sample and a corresponding label according to a behavior sequence corresponding to the user ID;
extracting characteristics of a sample to obtain sample characteristics;
and constructing an initial model by adopting a model structure corresponding to the data format of the sample characteristics, and training the initial model by utilizing the sample characteristics and the corresponding labels thereof to obtain a target model.
In a third aspect, embodiments of the present disclosure provide an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described above.
In a fourth aspect, embodiments of the present disclosure provide a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform a method as described above.
In an embodiment of the present disclosure, an automated machine learning platform comprises: the machine learning platform can automatically complete a complete process from development to landing of a series of machine learning links such as data preprocessing, sample and label generation, feature extraction, model training and the like in different service scenes, so that user participation can be reduced, machine learning automation is realized, and further the use threshold of the user is reduced.
It should be understood that what is described in this summary is not intended to limit the critical or essential features of the embodiments of the disclosure nor to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The above and other features, advantages and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. For a better understanding of the present disclosure, and without limiting the disclosure thereto, the same or similar reference numerals denote the same or similar elements, wherein:
FIG. 1 illustrates a block diagram of an automated machine learning platform for network security applications provided by embodiments of the present disclosure;
FIG. 2 illustrates a block diagram of yet another automated machine learning platform for network security applications provided by embodiments of the present disclosure;
FIG. 3 illustrates a flow chart of an automated machine learning method for network security applications provided by embodiments of the present disclosure;
fig. 4 illustrates a block diagram of an exemplary electronic device capable of implementing embodiments of the present disclosure.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are some embodiments of the present disclosure, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the disclosure, are within the scope of the disclosure.
In addition, the term "and/or" herein is merely an association relationship describing an association object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.
In view of the problems occurring in the background art, embodiments of the present disclosure provide an automated machine learning platform and method for network security oriented applications. Specifically, the platform comprises: the system comprises a data preprocessing module, a sample and label generating module, a characteristic extracting module and a model training module; the data preprocessing module is used for preprocessing the flow data to obtain a behavior sequence corresponding to the user ID; the sample and label generating module is used for generating a sample and a label corresponding to the sample according to the behavior sequence corresponding to the user ID; the feature extraction module is used for extracting features of the sample to obtain sample features; the model training module is used for constructing an initial model by adopting a model structure corresponding to the data format of the sample characteristics, and training the initial model by utilizing the sample characteristics and the corresponding labels thereof to obtain a target model.
Therefore, based on the automatic machine learning platform, a series of complete processes from development to landing of machine learning links such as data preprocessing, sample and label generation, feature extraction, model training and the like in different business scenes can be automatically completed, user participation can be reduced, machine learning automation is realized, and the use threshold of the user is further reduced.
The following describes in detail, by way of specific embodiments, an automated machine learning platform and a method for network security oriented applications provided by embodiments of the present disclosure with reference to the accompanying drawings.
Fig. 1 illustrates a block diagram of an automated machine learning platform for network security applications provided by an embodiment of the present disclosure, and as illustrated in fig. 1, the automated machine learning platform may include: the system comprises a data preprocessing module, a sample and label generating module, a characteristic extracting module and a model training module.
And the data preprocessing module is used for preprocessing the flow data to obtain a behavior sequence corresponding to the user ID.
The preprocessing may include, without limitation, user identification, entity identification, URL path normalization, behavior sequence extraction, and the like.
And the sample and label generating module is used for generating a sample and a label corresponding to the sample according to the behavior sequence corresponding to the user ID.
For example, the sample and tag generation module may be configured to generate a positive sample, a negative sample, and a tag corresponding to each according to a behavior sequence corresponding to the user ID.
In this way, samples and labels for supervised learning may be automatically generated, thereby converting scene problems lacking labeled samples into supervised learning problems while providing more dimensional information for subsequent data analysis.
And the feature extraction module is used for extracting features of the sample to obtain sample features.
For example, the feature extraction module may be configured to perform feature extraction on the sample to obtain a sample feature by using a feature extraction policy corresponding to a data type of the sample.
The model training module is used for constructing an initial model by adopting a model structure corresponding to the data format of the sample characteristics, and training the initial model by utilizing the sample characteristics and the corresponding labels thereof to obtain a target model.
For example, the model training module may be used to randomly divide the sample features and their corresponding labels into a training set and a test set. For example, the model training module may be configured to up-sample the minority sample features in the sample features, generate minority sample features and corresponding labels thereof, that is, increase the minority sample features and corresponding labels thereof, and then randomly divide the sample features and corresponding labels thereof, and the generated minority sample features and corresponding labels thereof into a training set and a testing set, so that the sample features and corresponding labels thereof can be effectively balanced, and the subsequent model training effect is better.
At the same time, an initial model with different superparameters is built using a model structure corresponding to the data format of the sample features, i.e. each initial model has its unique superparameter combination.
And training each initial model by using a training set to obtain a plurality of trained candidate models, testing the plurality of trained candidate models by using a testing set, and selecting the candidate model with the best testing effect as a target model.
In an embodiment of the present disclosure, an automated machine learning platform comprises: the machine learning platform can automatically complete a complete process from development to landing of a series of machine learning links such as data preprocessing, sample and label generation, feature extraction, model training and the like in different service scenes, so that user participation can be reduced, machine learning automation is realized, and further the use threshold of the user is reduced.
Fig. 2 illustrates a block diagram of yet another automated machine learning platform for network security applications provided by an embodiment of the present disclosure, as illustrated in fig. 2, the automated machine learning platform may further include: and the model deployment module is used for deploying the target model on the cloud server so as to call the target model to predict the characteristics to be predicted through the API.
As shown in fig. 2, the automated machine learning platform may further include: the model updating module is used for collecting the prediction result, determining the characteristic of the prediction error according to the prediction result, constructing a data set according to the characteristic of the prediction error and the corresponding real result, and carrying out fine adjustment on the target model according to the data set so as to realize model optimization updating.
Fig. 3 illustrates a flowchart of an automated machine learning method for a network security application according to an embodiment of the present disclosure, and as illustrated in fig. 3, an automated machine learning method 300 may be applied to the automated machine learning platform illustrated in fig. 1-2, including the following steps:
s310, preprocessing the streaming data to obtain a behavior sequence corresponding to the user ID.
The preprocessing may include, without limitation, user identification, entity identification, URL path normalization, behavior sequence extraction, and the like.
S320, generating a sample and a label corresponding to the sample according to the behavior sequence corresponding to the user ID.
In some embodiments, positive samples, negative samples, and corresponding labels, respectively, may be generated from a sequence of actions corresponding to the user ID.
S330, extracting the characteristics of the sample to obtain the characteristics of the sample.
In some embodiments, feature extraction policies corresponding to the data type of the sample may be used to extract features from the sample to obtain sample features.
S340, constructing an initial model by adopting a model structure corresponding to the data format of the sample characteristics, and training the initial model by utilizing the sample characteristics and the corresponding labels thereof to obtain a target model.
In some embodiments, sample features and their corresponding tags may be randomly partitioned into training and testing sets. For example, up-sampling is performed on the minority sample features in the sample features to generate minority sample features and labels corresponding to the minority sample features, namely adding the minority sample features and labels corresponding to the minority sample features, and then randomly dividing the sample features and labels corresponding to the sample features and the generated minority sample features and labels corresponding to the minority sample features into a training set and a testing set.
And constructing initial models with different super parameters by adopting model structures corresponding to the data formats of the sample characteristics, training each initial model by utilizing a training set to obtain a plurality of trained candidate models, testing the plurality of trained candidate models by utilizing a testing set, and selecting the candidate model with the best testing effect as a target model.
In the embodiment of the disclosure, in the technical field of network security, there are a plurality of different data sets and different scenes, and model construction can be automatically realized for different data and different scenes through a unified automatic machine learning platform, so that a series of complete processes from development to landing of machine learning links such as data preprocessing, sample and tag generation, feature extraction, model training and the like can be efficiently completed for analysts under the condition of paying more attention to services, user participation can be reduced, machine learning automation is realized, and further the use threshold of users is reduced.
In some embodiments, the target model may be deployed on a cloud server to predict the features to be predicted by calling the target model through an API, which is convenient for the user to use.
Furthermore, in order to optimize the update model, a prediction result can be collected, the characteristics of the prediction error can be determined according to the prediction result, a data set can be constructed according to the characteristics of the prediction error and the corresponding real results, and fine adjustment can be performed on the target model according to the data set.
The following describes an automated machine learning method provided by an embodiment of the present disclosure in detail with reference to a specific embodiment, and is specifically as follows:
data preprocessing:
data preprocessing is mainly aimed at the most widely used data in the field of network security, including traffic data. The preprocessing of the flow data mainly comprises user identification, entity identification, URL path standardization and behavior sequence extraction, wherein various entities in the data are more accurately depicted through preprocessing, and input data required by a subsequent abnormality detection model is extracted.
Based on the flow data log, a user list is dynamically generated, a user ID field is appointed according to a specific field (src_ip+UA) as a user UID, and a user ID capable of identifying a behavior main body is extracted according to data content, wherein the ID is used as an analysis main body in subsequent feature analysis and model construction.
Entity identification mainly refers to various entities which are included in flow data and can be used as analysis objects, wherein the entities comprise data of related fields such as URL (uniform resource locator), equipment, IP (Internet protocol), UA (user agent), and the like, and information of the fields is extracted to be used as a basis for describing subsequent user behaviors.
In the standardization of URL entity and in the behavior depiction based on the URL, as different parameters exist in the URL, the same behavior has a certain difference in the data expression of the URL, so that the URL needs to be subjected to a certain degree of standardization processing, and the same behavior expressed by different parameters is subjected to unified processing, so that the follow-up feature analysis and model construction can be more accurate. By analyzing the parameter items in the URL, different value values in the parameter items are subjected to standardization processing, and a plurality of unimportant URL contents, such as jpg, png and other picture data and icon-related tag pictures, are filtered.
And extracting a behavior sequence, and constructing a session behavior sequence of the user according to the event stamp of the user log data aiming at the extracted user ID. Based on the same user, the logs are sequentially arranged from small to large through the time stamps to form a sequence set, and the length of the behavior sequence is required to be intercepted to a certain extent according to the time interval and the time span for the subsequent analysis to be more accurate. If the time interval of the adjacent logs of the same user is more than 30 minutes, the sequence is truncated, which indicates that the user operation has no continuity; if the operation time span of the same user log is greater than 1 day, intercepting is carried out across the corresponding time point of one day, and the last extracted behavior sequence is prevented from being overlong.
Sample and label generation:
the label mainly comprises a sample label and a characteristic label, the automatic generation of the label is a label generation process of preprocessed data or specific data sets based on a certain algorithm and rule, the label generation mainly provides a data sample for supervised learning, and meanwhile, the acquired label can enrich dimension information in subsequent characteristic analysis and supervised learning model construction.
The method for generating the sample label based on the rule mainly comprises the steps of generating a negative sample of a user behavior sequence, and generating the negative label of log data related to the IP entity in the preprocessing process through an IP blacklist.
And generating a negative sample aiming at the user behavior sequence, wherein the negative sample of the behavior sequence model is automatically built mainly by adopting normal access behaviors according to flow data. And arranging the events accessed by the user according to the time sequence, constructing a normal event access sequence of the user, and filtering the sequence with shorter access behaviors, namely the data with fewer user behaviors. For the access behaviors of normal users, events started in the behavior sequence and a bigram table appearing before and after the events, namely, the subsequent events possibly appearing in each event are calculated respectively.
For the normal behavior sequence of each user, generating a corresponding negative sample number according to the set negative sample construction proportion, for example, constructing according to 1:1, and correspondingly generating a negative sample for each normal behavior sequence. And randomly extracting a plurality of events in the sample sequence for each positive sample, modifying subsequent events of the corresponding events, and modifying the subsequent events into any events which do not occur in the bigram. The construction proportion is adjusted according to the number of the events acquired in the event extraction, if the number of all the events is large, a larger construction proportion can be set, and if the number of the events is small, the proportion can be correspondingly reduced.
And for the data after log preprocessing, generating a label for the corresponding log through an IP blacklist based on an IP entity in the data, adding a corresponding label to the data comprising the IP blacklist in the log, and identifying that the log possibly has a certain risk.
The method for generating the sample label based on the algorithm mainly comprises the steps of adopting an unsupervised learning algorithm such as iforst anomaly detection and DBSCAN, and the like, and assigning corresponding group labels to similar data through detecting anomalies in sample data and assigning corresponding anomaly labels to the anomalies, and cluster identification in clusters. The iforst algorithm is an abnormality detection algorithm, divides sample data by constructing a tree structure, if the data can be divided quickly in the dividing process, and judges the abnormality degree of the sample by integrating the division conditions in the multi-class tree, if the abnormality degree is high, which means that the data has larger difference from other data, the data can be primarily judged to belong to abnormality, and an abnormality label is given to the data.
The DBSCAN algorithm is a clustering method, clusters in data are acquired based on the aggregation degree of the data, the number of the clusters is automatically acquired completely according to the aggregation degree in the data, the influence of manually setting the number of the clusters on an algorithm result is reduced, a group label is given to the clusters according to the type of the clusters judged by the data, and an abnormal label is given to the data which do not belong to any cluster in the result, namely the data which belong to abnormal clusters.
The feature labels obtained based on the data features through the related algorithm are used for further analyzing the data, so that more feature references are made, and more effective features are provided for the construction of the follow-up supervised learning algorithm. Meanwhile, aiming at some specific safety scenes, the anomaly detection algorithm can directly detect specific data and identify specific anomaly behaviors in the scenes, the clustering algorithm can directly conduct group classification, and the partner accounts with cluster characteristics in the safety scenes are identified.
Feature extraction:
the feature extraction mainly comprises statistical calculation based on specific fields, statistical processing based on numerical data and vocabulary vectorization processing of text data.
Statistical calculations based on specific fields mainly include statistical calculations based on different entities at different time granularity by the user after preprocessing the data. The time granularity can be set according to actual needs, including customizable granularity settings per second, per minute, per hour, per day, and the like. And the statistical calculation of different entities comprises frequency, number of different values and the like. For numerical data, statistics such as summation, average, standard deviation and the like can be performed on the corresponding dimension data. Meanwhile, for data in different dimensions, correlation analysis can be adopted, and the degree of correlation between dimensions is analyzed through correlation coefficients, so that the effect of feature input in a subsequent model is improved.
Aiming at text type data in an attack scene, such as payload data of web attack and script file data of webshell, text segmentation is adopted to convert text content into vocabulary sequences.
And the payload of the web attack is different from the normal text, the content needs to be subjected to word segmentation based on a specific word segmentation rule, the integrity of script content can be ensured after word segmentation, and useless words are filtered. Firstly, adopting URL decoding to convert nonstandard letters and characters expressed by%00-%ff in the payload into original character contents; and secondly, the similar vocabularies caused by different domain names and rules are standardized into the same form, URL is replaced with http:// u, numbers appearing in the payload are standardized to 0, and finally, text contents are segmented by adopting a certain regular expression according to specific vocabulary composition in web attack, and the regular expression can be configured in a self-defined manner, so that more customized processing of the text contents is realized.
For webshell script, invalid content in a file, such as a large number of annotation content and a large number of invalid html code texts possibly stored in the script file, is filtered by identifying valid code data in the text, valid text content is subjected to word segmentation processing, a vocabulary sequence of the data is obtained, and meanwhile, word segmentation regular expressions can be customized to achieve faster word segmentation processing.
And after the text data is segmented, word2vec algorithm is adopted to carry out vectorization processing on the words, meanwhile, the platform provides visual display of the words after vectorization, PCA algorithm is adopted to carry out dimension reduction on the word vectors, visual display is realized, and word2vec algorithm is optimized through configuration table adjustment parameters in the platform, so that the vectors can better express word information.
Model training:
after the characteristics are extracted, a process of building a machine learning model by adopting a supervised learning method is adopted, an automatic machine learning platform provides various model structures, different model structures are automatically trained and selected according to sample characteristics, and an optimal model is obtained through automatic super-parameter tuning. For the non-supervision learning model, corresponding model training can be performed according to the related anomaly detection and clustering algorithm in the automatic label generation.
In supervised learning, because the tag data acquired in the security field has larger distribution difference, namely the acquisition cost of attack sample data is larger, and the actual data which can be collected is less, a few classes with less sample characteristics need to be up-sampled in the model training process, namely the data is expanded in different modes. The simplest mode is that the data with small data quantity is copied and expanded in the training process of each batch by random sampling, so that the quantity of different sample characteristics in the training of each batch is consistent; through slightly changing the characteristic value to a certain extent, a part of sample characteristics with fewer types are generated through slightly changing the characteristic value between two similar data points in the same type of data, so that different data achieve certain balance in the training process.
According to the sample characteristics with the labels generated in the platform and the sample characteristics with the labels customized by the user, the platform supports various model structure exploration, including an integrated learning algorithm: mainstream machine learning algorithms such as Random Forest and XGBoost simultaneously comprise a plurality of deep learning algorithms: different structures of neural networks such as DNN, CNN and RNN. And (3) carrying out automatic training on the models by adopting corresponding model structures for the data formats of different sample characteristics, correspondingly training a plurality of models in the training process, and finally judging and selecting the optimal model structure according to indexes on the test set.
For different models, the adjustment of the model hyper-parameters is also an important factor influencing the model effect, because the number of adjustable hyper-parameters in the model is large, and meanwhile, the adjustment range of each hyper-parameter is large, manual parameter adjustment is very time-consuming, the effect mainly depends on experience, the framework of the existing machine learning algorithm is mature gradually, the automatic parameter adjustment algorithm for the machine learning model is also quite large, the trouble of manual parameter adjustment can be helped, and the existing mainstream parameter adjustment algorithm comprises Grid search, random search, TPE, PSO, SMAC, bayesian parameter adjustment and the like.
The Bayesian optimization is a method for finding the minimum value of the function by using a model, and has been applied to super-parameter search in machine learning problems. TPE belongs to Bayesian optimization, is suitable for various scenes based on a sequence model optimization (SMBO) method, and has better efficiency than random search when the calculation resources are limited and only a small amount of experiments can be carried out. This approach approximates the performance of the hyper-parameters by building a history-based calculation in turn, and then the model tests based on the new hyper-parameters.
The automatic machine learning platform defaults to adopt a TPE algorithm to carry out super-parameter tuning, when a model project is established, only super-parameters of a specific model are required to be set in a range, model training of different super-parameter combinations can be automatically carried out in the model training process, each super-parameter combination is a test, finally effects of different test models on a test set are compared, a model with the optimal effect is taken as a result of the project, and a model file is output.
Model deployment:
after model training is finished, the model file can be directly uploaded to a cloud server, an API is accessed in an HTTP mode, and remote service call is conducted according to data format requirements specified by the API. And providing relevant configuration required by the model in the calling interface according to the appointed network security scene, namely realizing relevant functions such as model training, prediction, evaluation and the like.
Model updating:
after the model is deployed, the effect of the model is continuously evaluated through the platform, the prediction results of the model on the characteristics, including missing report and false report occurring in the prediction process, are analyzed, the characteristics with prediction errors are analyzed, and the construction, training and deployment of a new model can be carried out again; the model optimization method can also perform online learning on the original model, fine-tune the model according to the new data set, train the model through the new data by adopting parameters of the original model, and realize model optimization.
It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present disclosure is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present disclosure. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all alternative embodiments, and that the acts and modules referred to are not necessarily required by the present disclosure.
It can be appreciated that the automated machine learning method 300 shown in fig. 3 is applied to the automated machine learning platform shown in fig. 1-2, and can achieve the corresponding technical effects, and for brevity, will not be described in detail herein.
Fig. 4 illustrates a block diagram of an exemplary electronic device capable of implementing embodiments of the present disclosure. Electronic device 400 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic device 400 may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 4, the electronic device 400 may include a computing unit 401 that may perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM) 402 or a computer program loaded from a storage unit 408 into a Random Access Memory (RAM) 403. In the RAM403, various programs and data required for the operation of the electronic device 400 may also be stored. The computing unit 401, ROM402, and RAM403 are connected to each other by a bus 404. An input/output (I/O) interface 405 is also connected to bus 404.
Various components in electronic device 400 are connected to I/O interface 405, including: an input unit 406 such as a keyboard, a mouse, etc.; an output unit 407 such as various types of displays, speakers, and the like; a storage unit 408, such as a magnetic disk, optical disk, etc.; and a communication unit 409 such as a network card, modem, wireless communication transceiver, etc. The communication unit 409 allows the electronic device 400 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.
The computing unit 401 may be a variety of general purpose and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 401 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 401 performs the various methods and processes described above, such as method 300. For example, in some embodiments, the method 300 may be implemented as a computer program product, including a computer program, tangibly embodied on a computer-readable medium, such as the storage unit 408. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 400 via the ROM402 and/or the communication unit 409. One or more of the steps of method 300 described above may be performed when a computer program is loaded into RAM403 and executed by computing unit 401. Alternatively, in other embodiments, computing unit 401 may be configured to perform method 300 by any other suitable means (e.g., by means of firmware).
The various embodiments described above herein may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems-on-a-chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a computer-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a computer-readable storage medium would include one or more wire-based electrical connections, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
It should be noted that, the present disclosure further provides a non-transitory computer readable storage medium storing computer instructions, where the computer instructions are configured to cause a computer to perform the method 300 and achieve corresponding technical effects achieved by performing the method according to the embodiments of the present disclosure, which are not described herein for brevity.
In addition, the present disclosure also provides a computer program product comprising a computer program which, when executed by a processor, implements the method 300.
To provide for interaction with a user, the embodiments described above may be implemented on a computer 5 having: display device (e.g. CRT) or display device for displaying information to user
A Liquid Crystal Display (LCD) monitor); and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, the feedback provided to the user may be any
What form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and 0 may receive input from a user in any form, including acoustic input, speech input, or tactile input.
The above-described embodiments may be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or
A computing system that includes a front-end component (e.g., a 5-user computer having a graphical user interface or web browser through which a user can interact with embodiments of the systems and techniques described here), or a computing system that includes such a back-end component, middleware component, or any combination of front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include:
local Area Networks (LANs), wide Area Networks (WANs), and the internet.
The 0 computer system may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.
The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims (10)

1. An automated machine learning platform for network security applications, the platform comprising: the system comprises a data preprocessing module, a sample and label generating module, a characteristic extracting module and a model training module;
the data preprocessing module is used for preprocessing the flow data to obtain a behavior sequence corresponding to the user ID;
the sample and label generating module is used for generating a sample and a label corresponding to the sample according to a behavior sequence corresponding to the user ID;
the characteristic extraction module is used for extracting characteristics of the sample to obtain sample characteristics;
the model training module is used for constructing an initial model by adopting a model structure corresponding to the data format of the sample characteristics, and training the initial model by utilizing the sample characteristics and the corresponding labels thereof to obtain a target model.
2. The platform of claim 1, wherein the preprocessing comprises: user identification, entity identification, URL path normalization, behavior sequence extraction.
3. The platform of claim 1, wherein the sample and tag generation module is specifically configured to:
and generating a positive sample, a negative sample and labels respectively corresponding to the positive sample and the negative sample according to the behavior sequence corresponding to the user ID.
4. The platform of claim 1, wherein the feature extraction module is specifically configured to:
and extracting the characteristics of the sample by adopting a characteristic extraction strategy corresponding to the data type of the sample to obtain the characteristics of the sample.
5. The platform of claim 1, wherein the model training module is specifically configured to:
randomly dividing the sample characteristics and the labels corresponding to the sample characteristics into a training set and a testing set;
constructing initial models with different super parameters by adopting model structures corresponding to the data formats of the sample characteristics;
training each initial model by utilizing a training set to obtain a plurality of trained candidate models;
and testing the trained multiple candidate models by using the test set, and selecting the candidate model with the best test effect as the target model.
6. The platform of claim 5, wherein the model training module is specifically configured to:
up-sampling a minority sample feature in the sample features to generate minority sample features and corresponding labels;
and randomly dividing the sample characteristics and the labels corresponding to the sample characteristics, and the generated minority sample characteristics and the labels corresponding to the sample characteristics into a training set and a testing set.
7. The platform of claim 1, further comprising: a model deployment module;
the model deployment module is used for deploying the target model on a cloud server so as to predict the characteristics to be predicted by calling the target model through an API.
8. The platform of claim 7, further comprising: a model updating module;
the model updating module is used for collecting the prediction result, determining the characteristic of the prediction error according to the prediction result, constructing a data set according to the characteristic of the prediction error and the corresponding real result, and carrying out fine adjustment on the target model according to the data set.
9. A method of automated machine learning for network security applications, wherein the method is applied to an automated machine learning platform for network security applications according to any of claims 1-8, comprising:
preprocessing flow data to obtain a behavior sequence corresponding to a user ID;
generating a sample and a corresponding label according to a behavior sequence corresponding to the user ID;
extracting the characteristics of the sample to obtain sample characteristics;
and constructing an initial model by adopting a model structure corresponding to the data format of the sample characteristics, and training the initial model by utilizing the sample characteristics and the corresponding labels thereof to obtain a target model.
10. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of claim 9.
CN202211635591.0A 2022-12-19 2022-12-19 Automatic machine learning platform and method for network security application Pending CN116155541A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211635591.0A CN116155541A (en) 2022-12-19 2022-12-19 Automatic machine learning platform and method for network security application

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211635591.0A CN116155541A (en) 2022-12-19 2022-12-19 Automatic machine learning platform and method for network security application

Publications (1)

Publication Number Publication Date
CN116155541A true CN116155541A (en) 2023-05-23

Family

ID=86339992

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211635591.0A Pending CN116155541A (en) 2022-12-19 2022-12-19 Automatic machine learning platform and method for network security application

Country Status (1)

Country Link
CN (1) CN116155541A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116527411A (en) * 2023-07-05 2023-08-01 安羚科技(杭州)有限公司 Data security intelligent protection model construction method and device and collaboration platform

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116527411A (en) * 2023-07-05 2023-08-01 安羚科技(杭州)有限公司 Data security intelligent protection model construction method and device and collaboration platform
CN116527411B (en) * 2023-07-05 2023-09-22 安羚科技(杭州)有限公司 Data security intelligent protection model construction method and device and collaboration platform

Similar Documents

Publication Publication Date Title
CN110909165B (en) Data processing method, device, medium and electronic equipment
KR20220113881A (en) Method and apparatus for generating pre-trained model, electronic device and storage medium
CN112148772A (en) Alarm root cause identification method, device, equipment and storage medium
CN109240895A (en) A kind of processing method and processing device for analyzing log failure
CN111738331A (en) User classification method and device, computer-readable storage medium and electronic device
CN114218302A (en) Information processing method, device, equipment and storage medium
CN116155541A (en) Automatic machine learning platform and method for network security application
CN113282433B (en) Cluster anomaly detection method, device and related equipment
CN114037059A (en) Pre-training model, model generation method, data processing method and data processing device
CN114896291A (en) Training method and sequencing method of multi-agent model
CN113590764A (en) Training sample construction method and device, electronic equipment and storage medium
CN111352820A (en) Method, equipment and device for predicting and monitoring running state of high-performance application
CN114511022B (en) Feature screening, behavior recognition model training and abnormal behavior recognition method and device
CN113051911B (en) Method, apparatus, device, medium and program product for extracting sensitive words
CN114090601B (en) Data screening method, device, equipment and storage medium
CN113612777B (en) Training method, flow classification method, device, electronic equipment and storage medium
CN114548307A (en) Classification model training method and device, and classification method and device
CN115408236A (en) Log data auditing system, method, equipment and medium
CN115619245A (en) Portrait construction and classification method and system based on data dimension reduction method
CN114021642A (en) Data processing method and device, electronic equipment and storage medium
CN114444514A (en) Semantic matching model training method, semantic matching method and related device
CN114492364A (en) Same vulnerability judgment method, device, equipment and storage medium
CN113297289A (en) Method and device for extracting business data from database and electronic equipment
CN113392920A (en) Method, apparatus, device, medium, and program product for generating cheating prediction model
CN113110984B (en) Report processing method, report processing device, computer system and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination