CN111967575A

CN111967575A - Semi-automatic model updating system and model updating method

Info

Publication number: CN111967575A
Application number: CN202010711361.2A
Authority: CN
Inventors: 雷炳盛; 陈国庆; 谢强
Original assignee: Wuhan Jiyi Network Technology Co ltd
Current assignee: Wuhan Jiyi Network Technology Co ltd
Priority date: 2020-07-22
Filing date: 2020-07-22
Publication date: 2020-11-20

Abstract

The invention provides a semi-automatic model updating system and a model updating method. Firstly, suspicious data in a standardized data set are extracted and marked regularly through strategy marks, Cluster model marks, manual analysis marks and feature visualization marks of an original CNN model, and a suspicious data set is obtained. Performing mode division on the suspicious data set, and eliminating data with low recognition degree; and then, selecting an end-to-end CNN model for training and testing, and deriving a model training report and a visual interface of a training process. And (3) an engineer checks the recognition degree of the suspicious sample and the false sealing rate of the positive sample according to the model training report, so that the model precision is continuously improved, the false sealing rate is reduced, and the optimal new CNN model is obtained. The method acquires suspicious data based on a plurality of angles and methods, and trains and updates the CNN model periodically, thereby gradually improving the coverage rate and the identification precision of the CNN model to abnormal mode data.

Description

Semi-automatic model updating system and model updating method

Technical Field

The invention belongs to the technical field of internet information security, and particularly relates to a semi-automatic model updating system and a model updating method.

Background

With the rapid development of internet information technology, the number of service scenes applied by internet technology is increasing. The internet application brings convenience to users and brings certain risks. In order to reduce the risk of service processing and improve the security controllability of the internet, the server generally needs to perform risk identification on the current service processing based on a preset model in the actual service processing.

Because the internet application has the characteristics of fast scene change, short updating period and the like, when the model is updated, the coverage rate and the accuracy rate of the updated model are required to be higher, and the updating efficiency of the model is required to be high. Before the model is updated, the model is usually trained and tested by adopting a training sample and a testing sample, and the model is updated only after the model passes the testing. However, when the trained model is used to process the test sample, the assumption is actually based on the homogeneity of the training sample and the test sample, but in many cases, the assumption is not completely true, so that the model obtained based on the training data set is not suitable for the data set to be tested, and the problem is particularly prominent when new data is generated. And it takes a lot of time and labor to label the samples on the new data, so that the cost of updating the model is large.

The chinese invention patent CN104699685B discloses a model updating device and method, a data processing device and method, which are used for updating targets in a multi-model system. Each model in the multi-model system trains and updates the model for the training data set, tests the test data set by using the model to obtain a pseudo label, compares the characteristic distribution of the training data set with the characteristic distribution of the test data set, aims at minimizing the difference between the two distributions, and continuously adjusts the relevant parameters of the target model to achieve the aim of updating the model. By adopting the method, the model can be updated at a lower cost and is close to real data, so that the performance of the system is improved. The following problems still remain: labels of the training data set and the data set to be tested are not obtained based on data self information, but prediction results of a plurality of models are used as pseudo labels; the optimization of the target model is to minimize the distribution difference degree of the characteristics calculated by using multiple models, and the constraint condition is not sufficient; the misjudgment rate and the coverage rate of the model are not considered; a large amount of manual parameter adjustment is needed in the whole process; the whole process only considers the homogeneity of the data and does not consider the heterogeneity of the data.

The invention patent CN109739869A discloses a method and a system for generating a model operation report, and the method places some preset information of model training in a visual interface according to a certain rule to generate the model operation report, so as to solve the technical problem that the digital assets in the prior art can not be uniformly managed and commercially used. However, the method mainly teaches how to visualize the content of model training on the interface, and does not describe in detail the screening process of the training data set and the test data set, and the training, updating and monitoring of the model. And the visualization part is only the model part, and the data screening process does not carry out relevant detection. However, in the field of safety, the collection of the sample is time-consuming, laborious and expensive. In the early stage of model training, due to the lack of abundant attack samples (namely negative samples), the model processed by training is difficult to reach the optimal level, and continuous iterative updating is required in the later stage.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a semi-automatic model updating system and a model updating method, so as to solve the problems of low coverage rate and low accuracy rate of model updating caused by low efficiency and accuracy rate of abnormal data collection in the prior art.

In order to achieve the purpose, the invention adopts the following technical scheme:

a semi-automatic model updating system comprises a data preprocessing module, a suspicious data extraction module, a CNN model training module and a CNN model updating module; the suspicious data extraction module comprises a strategy marking module, a Cluster model marking module, an artificial marking module and a characteristic visualization marking module;

the data preprocessing module is used for preprocessing the original data to obtain a standardized data set so as to improve the calculation efficiency of the data;

the suspicious data extraction module is used for extracting and marking suspicious data in the standardized data set to obtain a suspicious data set; the strategy marking module is used for carrying out strategy marking on the suspicious data according to strategy rules; the Cluster model marking module automatically marks the suspicious data by utilizing an online learning model; the manual marking module is used for manually carrying out abnormity analysis on the standardized data set and manually marking suspicious data; the characteristic visual marking module is used for carrying out characteristic visual marking on the leaked suspicious data through the original CNN model;

the CNN model training module is used for training an original CNN model by taking the suspicious data set as a training sample to obtain a CNN model training report and a new CNN model;

and the CNN model updating module analyzes a CNN model training report by an engineer and determines whether to replace the original CNN model with the new CNN model.

Further, the Cluster model marking module automatically marks the suspicious data by using the Cluster model and automatically detects the misjudgment of the positive sample.

Further, the method for the feature visualization marking module to perform the feature visualization marking on the suspicious data includes: and performing feature visualization on the leaked suspicious data through the original CNN model, eliminating data with the similarity higher than a preset threshold with the positive sample data, retaining the data with the similarity lower than the preset threshold with the positive sample data, and marking the data as the suspicious data.

Further, the suspicious data extraction module extracts and marks the suspicious data set, wherein the suspicious data set comprises a regular mode and a non-regular mode; the period of the regular mode is 1-15 days, and the non-regular mode is used for extracting and marking the suspicious data set at irregular time so as to solve the problem of sudden attack; the regular period is 1-15 days; and when the suspicious data set is an empty set, the CNN model training module does not train the original CNN model.

Furthermore, the semi-automatic model updating system further comprises a suspicious data pattern division module, which is used for performing pattern division on the suspicious data set, eliminating suspicious samples with the identification degree of the original CNN model lower than a preset threshold value, and obtaining suspicious samples of a plurality of patterns so as to improve the identification degree of the suspicious samples in the original CNN model.

Further, the CNN model training module takes the suspicious samples of the plurality of patterns as training samples, and trains the original CNN model to obtain a model training report and a new CNN model.

Further, the updating method of the CNN model updating module includes:

(a) a data modeling engineer checks the recognition rate of the new CNN model to the suspicious samples of each mode and the recognition rate of the positive samples according to the model training report;

(b) then screening out suspicious samples with the recognition rate higher than a preset threshold value, taking the suspicious samples as training samples, and then training a CNN model to obtain a final version of the CNN model;

(c) and finally, testing the positive sample by using the final-version CNN model, and when the recognition rate of the final-version CNN model on the positive sample is more than 95%, organizing the final-version CNN model to be online and replace the original CNN model.

Further, the process of training the original CNN model includes: applying for computer resources, importing the suspicious sample, preprocessing data in the suspicious sample, training the original CNN model, visualizing the training process, and exporting a model training report and a new CNN model file.

Further, the computer resources comprise a cpu, a gpu, a memory; the training process visualization uses the TensorBorad tool.

A semi-automatic model updating method adopts the semi-automatic model updating system to update a model, and comprises the following steps:

s1, creating an original CNN model, and deploying the original CNN model on a line;

s2, carrying out standardization preprocessing on the original data to obtain a standardization data set;

s3, regularly extracting and marking suspicious data in the standardized data set through strategy marks, Cluster model marks, manual analysis marks and feature visualization marks of an original CNN model to obtain a suspicious data set;

s4, performing mode division on the suspicious data set, and eliminating suspicious samples with the identification degree of the original CNN model lower than a preset threshold value to obtain suspicious samples of a plurality of modes;

s5, taking the suspicious sample as a training sample, and training the original CNN model to obtain a model training report and a new CNN model; an engineer checks the recognition degree of the new CNN model to the suspicious sample and the false seal rate of the positive sample according to the model training report, then screens out the suspicious sample with the recognition degree lower than a preset threshold value, takes the suspicious sample as a training sample, and trains the CNN model to obtain the final CNN model;

and S6, testing the positive sample by using the final-version CNN model, and when the recognition rate of the final-version CNN model on the positive sample is more than 95%, organizing the final-version CNN model to be online and replacing the original CNN model.

Advantageous effects

Compared with the prior art, the semi-automatic model updating system and the model updating method provided by the invention have the following beneficial effects:

(1) the semi-automatic model updating system provided by the invention comprises the whole processes from data screening, feature visualization, model training to model evaluation and screening. Particularly, in the data screening stage, the source channel coverage of the abnormal data is wide, and the abnormal data covers all attack modes of the current attacker as far as possible, so that the attack data can be better identified during later CNN model training, and the capability of the model for identifying the abnormal data is improved as far as possible. Besides the manual data analysis marking, other data marking modes of the suspicious data marking realize automatic calculation, so that the marking accuracy and efficiency of the suspicious data are relatively high. The full-automatic calculation is realized in the model training stage; in the CNN model updating phase, part of the manual work needs to participate in model evaluation and model screening. The whole process thus achieves semi-automated model updating.

(2) The semi-automatic model updating system provided by the invention adopts the end-to-end CNN model for training, has stronger learning expression capability, does not need manual participation in the intermediate training process, and performs automatic learning, so that the training model stage realizes full-automatic calculation. In addition, the training data set of the CNN model in the invention is positive and negative samples which are accumulated manually continuously from line for a long time, and the accuracy is ensured, especially the positive sample set.

(3) According to the semi-automatic model updating system provided by the invention, each model training can use the test set to verify the effect of the model, and a visual interface of a model updating report and a training process is derived. The auxiliary modeling engineer evaluates the effect of the new CNN model, and the part uses TensorBoard, which can directly observe the training process of the model on line. Meanwhile, a training process log is recorded and serves as a model training report, and a good report can be obtained after a part of modules are selected in advance and managed. The gravity center of the proposal is placed on abnormal data collection and new version model updating, which is a complete process from original data to model output.

(4) According to the semi-automatic model updating system provided by the invention, the derived model training report can provide the misjudgment rate, coverage rate and precision of various positive and negative samples so as to more comprehensively reflect the effect of a new CNN model, and the visualization of the model training process can assist a data modeling engineer in making a decision. And adjusting the learning direction of the model in a targeted manner according to the prediction result of the new CNN model on each data file, finally finding an optimal model, and storing the model structure, the model parameters and the prediction effect of each data file.

Drawings

Fig. 1 is a block diagram illustrating a semi-automatic model updating system according to embodiment 1;

fig. 2 is a schematic diagram of a training process flow of a CNN model training module of the semi-automatic model updating system provided in the present invention;

fig. 3 is a schematic flow chart of an updating method of a CNN model training module of the semi-automatic model updating system according to the present invention;

fig. 4 is a block diagram showing a configuration of a semi-automatic model update system according to embodiment 2;

fig. 5 is a flow chart of a semi-automatic model updating method provided by the present invention.

Detailed Description

The technical solutions of the embodiments of the present invention will be described clearly and completely below, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments; all other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without any inventive step, are within the scope of the present invention.

Example 1

Referring to fig. 1, a semi-automatic model updating system includes a data preprocessing module, a suspicious data extracting module, a CNN model training module, and a CNN model updating module; the suspicious data extraction module comprises a strategy marking module, a Cluster model marking module, an artificial marking module and a characteristic visualization marking module.

The data preprocessing module is used for carrying out standardization preprocessing on the original data to obtain a standardized data set so as to improve the calculation efficiency of the data. The method specifically comprises the following steps: (1) determining Schema of data, and extracting main information for the user agent, such as: device name, version, browser, version, etc. For example, processing the track data into a standard array; (2) original line type storage is converted into column type storage, and a large amount of unnecessary repeated analysis is reduced; (3) pre-polymerization treatment: if the characteristics are calculated by combining ip and user agent, the data can be well calculated and stored in advance, and repeated calculation at a later stage is not needed, and the like.

And the suspicious data extraction module is used for extracting and marking the suspicious data in the standardized data set to obtain a suspicious data set. The method specifically comprises the following extraction and marking modes:

and the strategy marking module is used for carrying out strategy marking on the suspicious data according to strategy rules. Long-term data analysis accumulates a large set of policy rules that can discover from multiple dimensions whether data is storing anomalies. The strategy is divided into a single item and a group, the single item refers to that whether the strategy is touched or not can be directly judged by taking only one piece of data, and the group refers to that high-frequency behavior judgment needs to be carried out by means of data of one period (usually one day). Then, the mark is printed on each piece of data, and the strategy marking task is completed.

The model marking module is a Cluster model marking module, and is used for automatically marking the suspicious data by utilizing the Cluster model and automatically detecting the misjudgment of the positive sample. The source of the positive sample is from various sources, such as: characteristic website collection (data without attacks), "honeypot" labeling, manual collection, etc., are usually positive sample accumulations over many years. The Cluster model can quickly and accurately automatically screen and check suspicious data, and the screening and checking precision and efficiency are high. The model is an online dynamic model, and can mark data in real time according to the online flow and the difference degree of feature distribution.

And the manual marking module is used for manually carrying out abnormity analysis on the standardized data set and manually marking the suspicious data. And performing deep data anomaly analysis on the data by a data analysis engineer, and adding a mark of suspicious data in an analysis result into the original data.

And the characteristic visual marking module is used for carrying out characteristic visual marking on the leaked suspicious data through the original CNN model.

The method for the characteristic visual marking module to perform the characteristic visual marking on the suspicious data comprises the following steps: and performing feature visualization on the leaked suspicious data through the original CNN model, monitoring the cross condition between the positive sample and the suspicious data, and determining whether to add the suspicious data into a training set. And eliminating the data with the similarity larger than a preset threshold value with the positive sample data, retaining the data with the similarity smaller than the preset threshold value with the positive sample data, and marking the data as suspicious data. The preset threshold of the similarity refers to the distribution difference of positive and negative samples in the last layer feature of the CNN, and the KL distance can be preferably used for measuring the difference of the features. In practical terms, a KL greater than 1 indicates that the two distributions are very different. Since a larger KL value indicates a smaller similarity, it is specified here that when KL is larger than 1, the representative similarity is smaller than a preset threshold.

Through the above several suspicious data extraction and marking modes, the source channel coverage of abnormal data is wider, and all attack modes of the current attacker are covered as much as possible, so that the attack data can be better identified during later CNN model training, and the abnormal data identification capability of the model is improved as much as possible. Besides the manual data analysis marking, other data marking modes of the suspicious data marking realize automatic calculation, so that the marking accuracy and efficiency of the suspicious data are relatively high.

And the CNN model training module is used for training the original CNN model by taking the suspicious data set as a training sample to obtain a model training report and a new CNN model. Referring to fig. 2, the process of training the original CNN model includes: applying for computer resources, importing the suspicious sample, training the original CNN model, visualizing the training process, and exporting a model training report and a new CNN model file.

The computer resources comprise a cpu, a gpu, a memory and the like; the training process visualization uses the TensorBorad tool.

The end-to-end CNN model is adopted for training, so that the method has stronger learning expression capability, and automatic learning is performed without manual participation in the intermediate training process, so that full-automatic calculation is realized in the training model stage. And (3) verifying the effect of the model by using the verification set in each training round, automatically screening the optimal model, and storing the model structure, the model parameters and the prediction effect of each data file.

Referring to fig. 3, the CNN model updating module is configured to replace the original CNN model with the new CNN model. The updating method of the CNN model updating module comprises the following steps:

A model updating report and a visual interface of a training process are led out for each training model, the model training report provides the misjudgment rate, the coverage rate and the precision of various positive and negative samples so as to more comprehensively reflect the effect of a new CNN model, and the visualization of the model training process can assist a data modeling engineer in making a decision. And adjusting the learning direction of the model in a targeted manner according to the prediction result of the new CNN model on each data file, and finally searching for an optimal model. In the stage of updating the CNN model, part of the CNN model needs to manually participate in model evaluation and model screening. Therefore, the whole model updating process realizes semi-automatic model updating.

In order to more accurately and effectively extract suspicious data, the suspicious data extraction module extracts and marks the suspicious data through a regular mode and an irregular mode, wherein the regular mode is used as a main mode, and the irregular mode is used as an auxiliary mode; the period of the periodic mode is 1-15 days, and preferably one week. The non-periodic mode is used for performing irregular extraction and marking on the suspicious data set so as to solve the problem of sudden attack. And when the suspicious data set extracted in each period is an empty set, the CNN model training module does not train the original CNN model.

Example 2

Referring to fig. 4, a semi-automatic model updating system is different from that in embodiment 1 in that the model updating system further includes a suspicious data pattern partitioning module, configured to perform pattern partitioning on the suspicious data set, and remove suspicious samples with the identification degree of the original CNN model being lower than a preset threshold, so as to obtain suspicious samples in multiple patterns, so as to improve the identification degree of the suspicious samples in the original CNN model and reduce the risk of false sealing of the CNN model.

Further, the CNN model training module takes the suspicious samples of the plurality of patterns as training samples, and trains the original CNN model by using a method substantially the same as that in embodiment 1 to obtain a model training report and a new CNN model.

Example 3

Referring to fig. 5, a semi-automatic model updating method for updating a model by using the semi-automatic model updating system of embodiment 1 includes the following steps:

s3, extracting and marking suspicious data in the standardized data set regularly (for 7 days) through a strategy mark, a Cluster model mark, a manual analysis mark and a feature visualization mark of an original CNN model respectively to obtain a suspicious data set;

s4, taking the suspicious data set as a training sample, and training the original CNN model to obtain a model training report and a new CNN model;

s5, a data modeling engineer checks the recognition degree of the new CNN model to the suspicious sample and the false seal rate of the positive sample according to the model training report, then screens out the suspicious sample with the recognition degree lower than a preset threshold value, takes the suspicious sample as a training sample, and trains the CNN model to obtain the final CNN model;

Example 4

A semi-automatic model updating method for updating a model by using the semi-automatic model updating system described in embodiment 2, comprising the steps of:

s5, taking the suspicious sample as a training sample, and training the original CNN model to obtain a model training report and a new CNN model; a data modeling engineer checks the recognition degree of the new CNN model to the suspicious sample and the false seal rate of the positive sample according to the model training report, then screens out the suspicious sample with the recognition degree lower than a preset threshold value, takes the suspicious sample as a training sample, and trains the CNN model to obtain the final CNN model;

and S6, testing the reference sample by using the finally obtained new CNN model, and when the average recognition rate of the new CNN model to the reference sample is more than 95%, organizing the finally obtained new CNN model to be online to replace the original CNN model.

In summary, the present invention provides a whole set of model updating processes, including the whole processes from data screening, feature visualization, model training to model evaluation and screening. Particularly, in the data screening stage, the source channel coverage of the abnormal data is wide, and the abnormal data covers all attack modes of the current attacker as far as possible, so that the attack data can be better identified during later CNN model training, and the capability of the model for identifying the abnormal data is improved as far as possible. Besides the manual data analysis marking, other data marking modes of the suspicious data marking realize automatic calculation, so that the marking accuracy and efficiency of the suspicious data are relatively high. The full-automatic calculation is realized in the model training stage; in the CNN model updating phase, part of the manual work needs to participate in model evaluation and model screening. The whole process thus achieves semi-automated model updating.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.

Claims

1. A semi-automatic model updating system is characterized by comprising a data preprocessing module, a suspicious data extraction module, a CNN model training module and a CNN model updating module; the suspicious data extraction module comprises a strategy marking module, a Cluster model marking module, an artificial marking module and a characteristic visualization marking module;

the suspicious data extraction module is used for extracting and marking suspicious data in the standardized data set to obtain a suspicious data set; the policy marking module is used for automatically marking the suspicious data according to policy rules; the Cluster model marking module automatically marks the suspicious data by utilizing an online learning model; the manual marking module is used for manually carrying out abnormity analysis on the standardized data set and manually marking suspicious data; the characteristic visual marking module is used for carrying out characteristic visual marking on the leaked suspicious data through the original CNN model;

2. The semi-automatic model updating system of claim 1, wherein the Cluster model marking module automatically marks the suspicious data by using a Cluster model and automatically detects the misjudgment of the positive sample.

3. The semi-automated model updating system of claim 1, wherein the method for feature-visual tagging of the suspicious data by the feature-visual tagging module comprises: and performing feature visualization on the leaked suspicious data through the original CNN model, eliminating data with the similarity higher than a preset threshold with the positive sample data, retaining the data with the similarity lower than the preset threshold with the positive sample data, and marking the data as the suspicious data.

4. The semi-automated model updating system of claim 1, wherein the suspect data extraction module extracts and labels the suspect data set comprises both periodic and non-periodic modes; the period of the regular mode is 1-15 days, and the non-regular mode is used for extracting and marking the suspicious data set at irregular time so as to solve the problem of sudden attack; and when the suspicious data set is an empty set, the CNN model training module does not train the original CNN model.

5. The semi-automatic model updating system according to any one of claims 1 to 4, further comprising a suspicious data pattern classification module, configured to perform pattern classification on the suspicious data set, and eliminate suspicious samples with the identification degree of the original CNN model lower than a preset threshold, so as to obtain suspicious samples with several patterns, so as to improve the identification degree of the suspicious samples in the original CNN model.

6. The semi-automated model updating system of claim 5, wherein the CNN model training module takes the suspicious samples of the plurality of patterns as training samples to train the original CNN model, so as to obtain a model training report and a new CNN model.

7. The semi-automated model updating system of claim 6, wherein the updating method of the CNN model updating module comprises:

8. The semi-automated model updating system of claim 6, wherein the process of training the original CNN model comprises: applying for computer resources, importing the suspicious sample, preprocessing data in the suspicious sample, training the original CNN model, visualizing the training process, and exporting a model training report and a new CNN model file.

9. A semi-automated model updating system according to claim 8, wherein the computer resources comprise cpu, gpu, memory; the training process visualization uses the TensorBorad tool.

10. A semi-automatic model updating method, wherein the semi-automatic model updating system of any one of claims 1 to 9 is used for model updating, and the method comprises the following steps: