CN115310091A

CN115310091A - Target security level identification method and device based on fusion model and electronic equipment

Info

Publication number: CN115310091A
Application number: CN202210786181.XA
Authority: CN
Inventors: 宋孟楠; 王磊
Original assignee: Shanghai Qiyue Information Technology Co Ltd
Current assignee: Shanghai Qiyue Information Technology Co Ltd
Priority date: 2022-07-04
Filing date: 2022-07-04
Publication date: 2022-11-08

Abstract

The invention discloses a target security level identification method and device based on a fusion model and electronic equipment, wherein the method comprises the following steps: preprocessing historical target data to obtain a training set; dividing the training set into a plurality of groups of training subsets; for each training subset, training a corresponding fusion model by adopting a plurality of different models and fusion modes; respectively inputting target data to be recognized into each trained fusion model to obtain a plurality of initial security levels of the target; and performing fusion calculation on the plurality of initial security levels of the target to be identified to obtain the security level of the target. The invention divides the training set into a plurality of groups of training subsets; and training each training subset by adopting a plurality of different models and fusion modes, thereby improving the accuracy of the initial security level of the target, fusing a plurality of initial security levels of the target on the basis, strengthening the protection of data security in the platform, avoiding the occurrence of data leakage and improving the data security of the platform.

Description

Target security level identification method and device based on fusion model and electronic equipment

Technical Field

The invention relates to the field of data processing, in particular to a target security level identification method and device based on a fusion model, electronic equipment and a computer readable medium.

Background

With the development of the internet, various internet service platforms have appeared, such as: the system comprises an online shopping platform, an online car booking platform, a sharing platform, a map, music and the like. These platforms bring great convenience to people's lives, but because of their close coupling with the internet, there are some such as: the safety of fraud, lost credit and the like is hidden, so the safety identification of the equipment is particularly important in an internet service platform.

Currently, devices are identified by tree class integration models such as XGBoost, lightGBM, GBDT, etc. to determine whether the devices are secure. In the training process, the models are limited by factors such as incomplete data acquisition and scene data loss, so that partial data loss usually occurs in a training set, when partial data in the training set is lost, the training set is usually filled with a certain preset default value at present, and the training set filled with the default value is used for training the tree model. Because the difference between the default value and the real value is usually large and cannot well reflect the real value, the training mode influences the sequencing capability of the model, so that the safety identification of the equipment is not accurate enough, and the risk of data leakage in an internet service platform exists.

Disclosure of Invention

In view of the above, the present invention is directed to a method, an apparatus, an electronic device and a computer-readable medium for identifying a target security level based on a fusion model, so as to at least partially solve at least one of the above technical problems.

In order to solve the above technical problem, a first aspect of the present invention provides a target security level identification method based on a fusion model, where the method includes:

preprocessing historical target data to obtain a training set;

dividing the training set into a plurality of sets of training subsets;

for each training subset, training a corresponding fusion model by adopting a plurality of different models and fusion modes;

respectively inputting target data to be recognized into each trained fusion model to obtain a plurality of initial security levels of the target to be recognized;

and performing fusion calculation on the plurality of initial security levels of the target to be recognized to obtain the security level of the target to be recognized.

According to a preferred embodiment of the present invention, for each training subset, training a corresponding fusion model using a plurality of different models and fusion methods includes:

aiming at each training subset, training by adopting different types of models and/or models with the same type and different parameters and a fusion mode; wherein, the models and the fusion modes adopted by the training subsets divided by the same training set are different.

According to a preferred embodiment of the present invention, the training of the corresponding fusion model using a plurality of different models and fusion modes includes:

dividing the training subset into a training sub data set and a verification sub data set;

training a plurality of models of different classes through the training subdata set and the verification subdata set to respectively obtain a plurality of groups of sub-fusion models;

carrying out model fusion on the multiple sub-fusion models to obtain corresponding fusion models;

and/or, the training of the corresponding fusion model by adopting a plurality of different models and fusion modes comprises the following steps:

generating a first model of a plurality of different parameters;

respectively inputting the training subsets into a plurality of first models for training to obtain a plurality of groups of sub-fusion models and output results output by each sub-fusion model;

carrying out mean value fusion on the plurality of groups of sub-fusion models and corresponding output results to obtain corresponding fusion models;

and/or, the training of the corresponding fusion model by adopting a plurality of different models and fusion modes comprises:

dividing the training subset into N sub-datasets;

respectively inputting the N sub-data sets into N second models with different parameters for training to obtain N groups of sub-fusion models and N output results output by each sub-fusion model;

and carrying out average fusion on the N groups of sub-fusion models and the corresponding N output results to obtain corresponding fusion models.

According to a preferred embodiment of the present invention, before the training of the corresponding fusion model using the plurality of different models and fusion modes, the method further includes:

determining a correlation coefficient of each model;

and taking the model with the correlation coefficient larger than the threshold value as the model of different classes and/or the model of different parameters of the same class.

According to a preferred embodiment of the present invention, the dividing the training set into a plurality of training subsets comprises:

carrying out balancing treatment on the training set;

and extracting samples from the training set after the balancing treatment for multiple times to respectively form a training subset.

According to a preferred embodiment of the present invention, the preprocessing the historical target data to obtain the training set includes:

performing data cleaning on historical target data to obtain a wide-table variable;

and performing descriptive exploration analysis on the wide-table variable, and screening out characteristic data according to an analysis result to obtain a training set.

In order to solve the above technical problem, a second aspect of the present invention provides a target security level identification apparatus based on a fusion model, the apparatus comprising:

the preprocessing module is used for preprocessing historical target data to obtain a training set;

a dividing module for dividing the training set into a plurality of groups of training subsets;

the training module is used for training the corresponding fusion model by adopting a plurality of different models and fusion modes for each training subset;

the input module is used for respectively inputting target data to be recognized into each trained fusion model to obtain a plurality of initial security levels of the target to be recognized;

and the fusion module is used for performing fusion calculation on the plurality of initial security levels of the target to be recognized to obtain the security level of the target to be recognized.

According to a preferred embodiment of the present invention, the training module is configured to perform training by using different types of models and/or models with the same type and different parameters and a fusion method for each training subset; wherein, the models and the fusion modes adopted by the training subsets divided by the same training set are different.

According to a preferred embodiment of the invention, the training module comprises:

a first partitioning module that partitions the training subset into a training sub-data set and a validation sub-data set;

the first training module is used for training a plurality of models of different categories through the training subdata set and the verification subdata set to respectively obtain a plurality of groups of sub-fusion models;

the first fusion module is used for carrying out model fusion on the multiple sub-fusion models to obtain corresponding fusion models;

and/or, the training module comprises:

a generation module for generating a first model of a plurality of different parameters;

the second training module is used for inputting the training subsets into a plurality of first models respectively for training to obtain a plurality of groups of sub-fusion models and output results output by each sub-fusion model;

the second fusion module is used for carrying out mean value fusion on the plurality of groups of sub-fusion models and the corresponding output results to obtain corresponding fusion models;

and/or, the training module comprises:

a third dividing module, configured to divide the training subset into N sub-data sets;

the third training module is used for inputting the N sub-data sets into N second models with different parameters respectively for training to obtain N groups of sub-fusion models and N output results output by each sub-fusion model;

and the third fusion module is used for carrying out average fusion on the N groups of sub-fusion models and the corresponding N output results to obtain corresponding fusion models.

According to a preferred embodiment of the invention, the device further comprises:

the determining module is used for determining the correlation coefficient of each model;

and the screening module is used for taking the models with the correlation coefficients larger than the threshold value as models of different categories and/or models of the same category and different parameters.

According to a preferred embodiment of the present invention, the dividing module includes:

the balancing processing module is used for carrying out balancing processing on the training set;

and the sampling module is used for extracting samples from the training set after the balancing treatment for multiple times to respectively form a training subset.

According to a preferred embodiment of the present invention, the preprocessing module includes:

the cleaning module is used for cleaning the historical target data to obtain a wide-table variable;

and the analysis screening module is used for performing descriptive exploration analysis on the wide-table variables, screening out characteristic data according to the analysis result and obtaining a training set.

To solve the above technical problem, a third aspect of the present invention provides an electronic device, comprising:

a processor; and

a memory storing computer executable instructions that, when executed, cause the processor to perform the method described above.

To solve the above technical problems, a fourth aspect of the present invention provides a computer-readable storage medium, wherein the computer-readable storage medium stores one or more programs which, when executed by a processor, implement the above method.

The invention divides a training set into a plurality of groups of training subsets; the method comprises the steps of training each training subset by adopting a plurality of different models and fusion modes, wherein the trained fusion model is compatible with data characteristics of the different training subsets and sequencing prediction capabilities of the different models based on the different fusion modes, so that the accuracy of the initial security level of the target is improved, and on the basis, the plurality of initial security levels of the target are fused, so that the security level of the target is more accurate, the protection on the data security in the internet service platform is enhanced, and the condition that the data leakage in the internet service platform is caused by the fact that the determined security level of the user target is not accurate enough in the related technology is effectively avoided.

Drawings

In order to make the technical problems solved by the present invention, the technical means adopted and the technical effects obtained more clear, the following will describe in detail the embodiments of the present invention with reference to the accompanying drawings. It is to be noted, however, that the drawings described below are only drawings of exemplary embodiments of the invention, from which other embodiments can be derived by those skilled in the art without inventive effort.

FIG. 1 is a schematic flow chart of a target security level identification method based on a fusion model according to an embodiment of the present invention;

FIG. 2 shows a training set D according to an embodiment of the present invention ₀ Divided into three training subsets D ₁ 、D ₂ 、D ₃ For each training subset, a schematic diagram of a corresponding fusion model is trained by adopting a plurality of different models and fusion modes;

FIG. 3 is a schematic diagram of a fusion model obtained by training SVM models with a plurality of different parameters according to an embodiment of the present invention;

FIG. 4 is a schematic structural framework diagram of a target security level recognition apparatus based on a fusion model according to an embodiment of the present invention;

FIG. 5 is a block diagram of an exemplary embodiment of an electronic device in accordance with the present invention;

FIG. 6 is a schematic diagram of one embodiment of a computer-readable medium of the present invention.

Detailed Description

Exemplary embodiments of the present invention will now be described more fully hereinafter with reference to the accompanying drawings, in which exemplary embodiments of the invention may be embodied in many specific forms, and should not be construed as limited to the embodiments set forth herein. Rather, these exemplary embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the invention to those skilled in the art.

The same reference numerals denote the same or similar elements, components, or parts throughout the drawings, and thus, a repetitive description thereof may be omitted hereinafter. It will also be understood that, although the terms first, second, third, etc. may be used herein to describe various elements, components, or sections, these elements, components, or sections should not be limited by these terms. That is, these phrases are used only to distinguish one from another. For example, a first device may also be referred to as a second device without departing from the spirit of the present invention. Furthermore, the term "and/or", "and/or" is intended to include all combinations of any one or more of the listed items.

Referring to fig. 1, fig. 1 is a method for identifying a target security level based on a fusion model according to the present invention, as shown in fig. 1, the method includes:

s1, preprocessing historical target data to obtain a training set;

in this embodiment, the target may be a terminal device or a server, wherein the target data may be basic information in the terminal device or the server, and the basic information may be log information, data transmission information, device status information, and the like of the terminal device or the server; the target may also be a user using the terminal device or the server, and the target data may include one or more of: the network name, the native place of the user, the time of logging in the system last time, publicly released content, historical behavior information of the user, the region where the equipment is located and other public information. Wherein: the target information may include a target model, a target ID, etc., and the behavior information may include: purchasing behavior, usage behavior, rental behavior, business browsing duration, and the like. The buried point behavior information may include: click behavior, browse behavior, and the like.

In addition, the data processing of the scheme can be carried out only through the user information which can not identify the user identity, so that the user privacy can be protected; the protection of the user privacy can be realized by deleting the information in which the user identity can be identified from the user information or anonymizing the information, wherein the anonymizing processing can be to process the data by an encryption means.

In order to train the stability and accuracy of the model, the present embodiment needs to perform preprocessing on the collected historical target data, such as: data cleaning, data integration, data transformation, data classification, variable screening, and the like. Illustratively, this step may include:

s11, carrying out data cleaning on historical target data to obtain a wide-table variable;

according to the embodiment, the repeated information and the error information are deleted through data cleaning, and the data consistency is improved. Exemplary, data cleansing includes, but is not limited to: analysis of variable loss rate (for example, screening characteristics with the loss rate of below 95%), abnormal value processing (for example, capping processing, abnormal value elimination, and the like), text variable processing, and the like. Historical target data after data cleaning is stored in the same data table, wide table variables of tens of thousands of dimensions are obtained, and the efficiency of iterative computation in the model training process can be improved.

And S12, performing descriptive exploration analysis on the wide-table variables, and screening out characteristic data according to analysis results to obtain a training set.

In this embodiment, the descriptive exploration analysis may include: the wide-table variable can be subjected to centralized trend analysis, discrete degree analysis and distribution form analysis, and characteristic data are screened out according to analysis results of the centralized trend analysis, the discrete degree analysis and the distribution form analysis to obtain a training set; the concentration trends include: mean, median, mode; the degree of dispersion includes: full distance, standard deviation, coefficient of variation, percentile, quartering difference and variance; distribution form: skewness, kurtosis.

In this step, the descriptive exploration analysis may also include: analyzing the discrimination (KS) of the target variable by combining the coverage of each variable, the concentration of characteristic values, the correlation and the significance of the target variable to obtain a first analysis result; analyzing the Information Value (IV) of the target variable by combining the coverage of each variable, the concentration of the characteristic value, the correlation and the significance of the target variable to obtain a second analysis result; and analyzing the feature importance ordering of the tree integration model (such as XGboost, RF and the like) by combining the coverage of each variable, the feature value concentration, the target variable correlation and the significance to obtain a third analysis result, and further, screening out features (such as 200) with the coverage rate higher than a threshold value and the target variable discrimination higher than a preset discrimination from the wide-list variables by combining the first analysis result, the second analysis result and the third analysis result to form a training set.

Further, in order to verify the effect of the fusion model, before this step, the historical target data may be divided into a training set and a testing set according to the time sequence of data output, such as: and (3) taking the historical user data between 2020.1.30 days and 2021.1.30 days as a training set, taking the historical user data between 2021.2.1 days and 2021.5.1 days as a test set, and verifying the effect of the fusion model through the test set.

S2, dividing the training set into a plurality of groups of training subsets;

in this step, the training set may be directly divided into a plurality of training subsets according to the number of samples, for example: averagely dividing the sample number of the training set into a plurality of groups of training subsets; or randomly extracting samples from the training subset to form a training subset; or samples are extracted from the training subset according to a preset rule to form the training subset.

In this embodiment, in order to ensure that the positive and negative samples in the training set are distributed in a balanced manner, the training set needs to be balanced first, and then training subsets are formed from multiple samples in the training set after the balancing process, so that the training set is divided into multiple groups of training subsets. Illustratively, a combined inheritance method can be adopted to balance the training set; the combination inheritance method realizes inheritance of prototype properties and methods by using a prototype chain, and realizes inheritance of instance properties by using constructor inheritance in a matching manner, so that data imbalance caused by sharing one piece of data by a plurality of instances for data of a reference type is avoided. In extracting the training subsets, an equal number of samples may be randomly extracted to constitute the training subsets. Such as: the training set has 150 ten thousand samples, and the samples are drawn in three times, so that each sample subset consists of 50 ten thousand samples drawn randomly.

S3, for each training subset, training a corresponding fusion model by adopting a plurality of different models and fusion modes;

in this embodiment: models and fusion modes adopted by training subsets divided from the same training set are different from each other, so that the trained fusion models can be compatible with data characteristics of different training subsets and sequencing prediction capabilities of different models based on different fusion modes.

Wherein: the plurality of different models used in the same training subset may be a plurality of models of different types (for example, the XGboost model and the LR model), a plurality of models of the same type with different parameters (for example, a logistic regression model with different parameters), or a plurality of models composed of models of different types and models of the same type with different parameters (for example, the XGboost model, the LR model with a parameter, and the LR model with B parameter). The fusion mode is used for fusing different models of the training subset, so that the advantages of the different models are exerted. The fusion mode may include: mean fusion, model fusion, cross fusion, feature fusion, and the like. In practical application, different models and different fusion modes can be combined at will to ensure that each training subset is trained by using different models and fusion modes. The number of different models adopted by each training subset can be flexibly configured according to the requirement.

Illustratively, if a training subset is trained using models of different classes, the training of the corresponding fusion model using a plurality of different models and fusion modes includes:

s31, dividing the training subset into a training subdata set and a verification subdata set;

training a plurality of different models by adopting a cross validation method to respectively obtain a plurality of groups of output results;

in order to obtain a stable and reliable model, the present embodiment trains a plurality of different models by using a cross-validation method. In the cross validation method, a training subset is divided into n data sets, one data set is taken out as a test set during each training, and the other n-1 data sets are used as training sets for training models and adjusting parameters, so that n models are trained through the n data sets. Wherein: the value of n can be determined according to the number of training models, such as: training 5 different models, so that n =5; alternatively, n may be set to a predetermined value, such as n =5; then no matter how many models are, the 5-fold cross-validation method is adopted for training.

For example, as shown in fig. 2, taking four different models, namely XGBoost, GBDT (gradient lifting tree), extretratres (extreme random tree) and RandomForest, as an example, respectively, the XGBoost, GBDT, extretratres and RandomForest models are trained by a 5-fold cross-validation method, and after the training is completed, the training subsets are respectively input into the four trained models to obtain four groups of output results.

S32, training a plurality of models of different types through the training subdata set and the verification subdata set to respectively obtain a plurality of groups of sub-fusion models;

s33, model fusion is carried out on the multiple sub-fusion models to obtain corresponding fusion models;

in this example, the model fusion may be: and (3) carrying out weight training on the output results of the multiple groups of sub-fusion models by adopting an LR model, thereby overcoming the problem of unbalanced weight distribution caused by artificially presetting weights.

The plurality of sets of sub-fusion models trained in step S32 and the LR model trained in step S33 generate corresponding fusion models.

In an example, if a training subset is trained using models of the same type and different parameters, the training of the corresponding fusion model using a plurality of different models and fusion modes includes:

s301, generating a plurality of first models with different parameters;

generating a first model of a plurality of different parameters;

wherein: the first model may be configured as desired, for example, the first model may employ the XGboost model. Illustratively, a plurality of different XGboost models may be generated through parameter perturbation, which refers to training a model with larger difference by randomly setting different parameters.

S302, inputting the training subsets into a plurality of first models respectively for training to obtain a plurality of groups of sub-fusion models and output results output by each sub-fusion model;

s303, carrying out mean value fusion on the plurality of sub-fusion models and the corresponding output results to obtain corresponding fusion models;

in this example, the fusion mode adopts mean fusion, that is: and giving different weights to the multiple groups of sub-fusion models according to a preset algorithm, and weighting output results corresponding to the multiple groups of sub-fusion models according to the weights.

The plurality of sets of sub-fusion models with different parameters trained in step S302 and the model weights trained in step S303 generate corresponding fusion models.

In another example, if a training subset is trained using models of the same type and different parameters, the training of the corresponding fusion model using a plurality of different models and fusion modes includes:

s311, dividing the training subset into N sub-data sets;

wherein: the value of N is the same as the number of different models to be trained, for example: subset D may be trained by Boostrap with put-back sampling ₂ Division into 5 sub-datasets D ₂₁ 、D ₂₂ …D ₂₅ Corresponding to 5 different models.

S312, inputting the N sub-data sets into N second models with different parameters respectively for training to obtain N groups of sub-fusion models and N output results output by each sub-fusion model;

wherein: the category of the second model is different from that of the first model, for example, the second model may adopt an SVM model, the specific training process may refer to fig. 3, after each sub data set is adopted to train each second model respectively to obtain each sub fusion model respectively, any sub data set may be used as a test set to evaluate the sub fusion model.

S313, carrying out average fusion on the N groups of sub-fusion models and the corresponding N output results to obtain corresponding fusion models.

In this example, the fusion mode is an average fusion, that is, the output results of all N groups of sub-fusion models are averaged.

The sub-fusion models with different parameters trained in step S312 and the model mean value trained in step S313 are used together to generate a corresponding fusion model.

In FIG. 2, the training set D ₀ Divided into three training subsets D ₁ 、D ₂ 、D ₃ Wherein: training subset D ₁ Training 4 XGBoost models with different parameters in a mode of S301-S303, obtaining a fusion model M1 after mean value fusion, and training a subset D ₂ Training 5 SVM models with different parameters by adopting a mode of S311-S313, obtaining a fusion model M2 after averaging, and training a subset D ₂ The XGboost model, the GBDT model, the Extretrees model and the RandomForest model are trained in the S31-S32 mode, and a fusion model M3 is obtained after LR model fusion.

In this embodiment, the smaller the correlation of the multiple different models used by each training subset is, the better the ranking effect of the fusion model obtained by training is, so that, for each training subset, before the corresponding fusion model is trained by using the multiple different models and fusion modes, the models may be screened, for example, the correlation coefficients of the different models are determined first; and then taking the model with the relation number larger than the threshold value as the final different model. Wherein: the correlation coefficient of the model can be evaluated by cosine similarity or Pearson correlation coefficient.

S4, respectively inputting target data to be recognized into each trained fusion model to obtain a plurality of initial security levels of the corresponding target to be recognized;

in order to ensure the recognition effect of the fusion model, the target data to be recognized may be preprocessed in step S1 in advance, and the preprocessed target data to be recognized is input into each trained fusion model (for example, the fusion models M1, M2, and M3 in fig. 2), where each fusion model outputs the initial security level of the target to be recognized, for example, the fusion model M1 outputs the initial security level of the target to be recognized as M1, the fusion model M2 outputs the initial security level of the target as M2, and the fusion model M3 outputs the initial security level of the target to be recognized as M3.

In this embodiment, the security level of the target to be recognized may be a specific score value, or may be a category value divided based on the score value, which is not specifically limited in the present invention.

And S5, performing fusion calculation on the plurality of initial security levels of the target to be recognized to obtain the security level of the target to be recognized.

As shown in fig. 2, a second layer model M is formed by performing fusion calculation on a plurality of initial security levels of the target to be recognized. Wherein: the fusion calculation for a plurality of initial security levels may include: mean fusion, model fusion, cross fusion, feature fusion, and the like. Illustratively, mean fusion may be employed for m1, m2, and m3, resulting in a target security level m +.

In this embodiment, the model identification capability is determined according to the identification result of each fusion model for each training sample in the corresponding training subset, the identification capability of the model may be determined by determining the AUC (Area Under Curve) of the model, the AUC (Area Under Curve) is defined as the Area enclosed by the coordinate axis Under the ROC Curve (characteristic Curve for the operation of the subject), the features of the training samples in each training subset are ranked according to the feature quantity ratio and the ranking of the corresponding model, the features in each training sample are scored, the higher the feature quantity ratio ranking, the higher the score of the features, the same the feature quantity ratio ranking, the higher the ranking of the corresponding model, the higher the score of the features, and finally, the total score of the features Under different training subsets is summed up, and the importance ranking of each feature is determined according to the total score of each feature.

In this embodiment, when performing fusion calculation on each initial security level, the feature quantity of the features of the training subsets used in the training process of the fusion model corresponding to each initial level may be obtained, the feature quantity of the training subsets corresponding to each fusion model is weighted and calculated according to the feature quantity ratio and the total score corresponding to each feature, so as to obtain the weight of each fusion model, and the security level of the target to be identified is obtained through weighting and calculation according to the weight of each fusion model and the corresponding initial security level; the features with the largest feature quantity can be used as the target features of the fusion models, the ranking weights corresponding to the fusion models are determined by combining the ranking of the target features of the fusion models in the importance ranking of the features, and the security level of the target to be identified is obtained through weighted calculation according to the ranking weights and the corresponding initial security level; or directly taking the total score of the target features of each fusion model as a weight, and carrying out weighted calculation on the total score and the corresponding initial security level to obtain the security level of the target to be identified.

For example, the training subset a includes 10 samples, and the total number of features of each sample includes: 20 features 1,15 features 2, 10 features 3; the training subset B includes 10 samples, and the total number of features of each sample includes: 10 features 1,15 features 2, 20 features 3,1 feature 4; the training subset C includes 10 samples, and the total number of features of each sample includes: 15 features 1, 20 features 2, 10 features 3; the AUC of the model corresponding to the training subset a is first, the AUC of the model corresponding to the training subset B is second, and the AUC of the model corresponding to the training subset C is third, and at this time, scoring the features of each training subset may be as follows: training subset a: feature 1: score 0.3, feature 3:0.2 min; training subset B: and (3) feature: score 0.4, feature 2:0.25 point, feature 1, 0.15 point, feature 4; training subset C: feature 2, feature 1, feature 3; the total score of each feature is that the score is 0.85 for feature 1 and 0.85 for feature 2; the score of feature 3 is 0.7, the score of feature 4 is 0.1, which is only an example, and the magnitude of the training subset and the magnitude of the features are more enormous in the practical application process, which is not particularly limited in this embodiment.

In order to compare the effect of the models obtained by the training mode on the same data set, the embodiment divides the historical target data into a training set and a test set in advance according to the time sequence of data output, trains each fusion model through the steps S2 and S3 by adopting the training set, and evaluates each trained model index through the test set. Wherein: table 1 shows the results of evaluation of AUC, KS and Top5-Lif for each model. The model M + in table 1 includes a second layer model M that fuses the fused models M1, M2, M3 and the target initial security levels output by M1, M2, M3.

	XGBoost	M1	M2	M3	M+
						AUC	0.6612	0.6638	0.6548	0.6631	0.6680
KS	0.2312	0.2343	0.2243	0.2342	0.2386
						Top5-Lift	2.12	2.15	2.01	2.14	2.23

Table 1: evaluation effects of different models on the same test set

As can be seen from table 1, in the same test set, the M + model trained by the present invention has better sorting capability, distinguishing capability and local effect promotion capability, so that the target security level is more accurate, thereby enhancing the protection of data security in the internet service platform, and effectively avoiding the occurrence of data leakage in the internet service platform due to the fact that the determined target security level is not accurate enough in the related art.

Fig. 4 is a target security level recognition apparatus based on a fusion model according to the present invention, as shown in fig. 4, the apparatus includes:

the preprocessing module 41 is configured to preprocess the historical target data to obtain a training set;

a dividing module 42 for dividing the training set into a plurality of groups of training subsets;

a training module 43, configured to train, for each training subset, a corresponding fusion model using a plurality of different models and fusion modes;

the input module 44 is configured to input target data to be recognized into each trained fusion model, respectively, to obtain multiple initial security levels of the target to be recognized;

and the fusion module 45 is configured to perform fusion calculation on the multiple initial security levels of the target to be recognized to obtain the security level of the target to be recognized.

In a specific embodiment, the training module 43 is configured to perform training by using different types of models and/or models with the same type and different parameters and a fusion mode for each training subset; wherein, the models and the fusion modes adopted by the training subsets divided by the same training set are different.

Optionally, the training module 43 includes:

a first partitioning module for partitioning the training subset into a training sub-data set and a validation sub-data set;

the first fusion module is used for carrying out model fusion on the plurality of groups of sub-fusion models to obtain corresponding fusion models;

and/or, the training module 43 comprises:

and/or, the training module 43 includes:

Further, the apparatus further comprises:

The dividing module 42 includes:

and the sampling module is used for extracting samples from the training set after the balancing processing for multiple times to respectively form a training subset.

The preprocessing module 41 includes:

Those skilled in the art will appreciate that the modules in the above-described embodiments of the apparatus may be distributed as described in the apparatus, and may be correspondingly modified and distributed in one or more apparatuses other than the above-described embodiments. The modules of the above embodiments may be combined into one module, or further split into multiple sub-modules.

In the following, embodiments of the electronic device of the present invention are described, which may be regarded as an implementation in physical form for the above-described embodiments of the method and apparatus of the present invention. Details described in the embodiments of the electronic device of the invention should be considered supplementary to the embodiments of the method or apparatus described above; for details not disclosed in the embodiments of the electronic device of the present invention, reference may be made to the above-described embodiments of the method or apparatus.

Fig. 5 is a block diagram of an exemplary embodiment of an electronic device according to the present invention. The electronic device shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiment of the present invention.

As shown in fig. 5, the electronic apparatus 500 of this exemplary embodiment is represented in the form of a general data processing object. The components of the electronic device 500 may include, but are not limited to: at least one processing unit 510, at least one memory unit 520, a bus 530 connecting different electronic device components (including the memory unit 520 and the processing unit 510), a display unit 540, and the like.

The storage unit 520 stores a computer readable program, which may be a code of a source program or a read-only program. The program may be executed by the processing unit 510 such that the processing unit 510 performs the steps of various embodiments of the present invention. For example, the processing unit 510 may perform the steps shown in fig. 1.

The memory unit 520 may include a readable medium in the form of a volatile memory unit, such as a random access memory unit (RAM) 5201 and/or a cache memory unit 5202, and may further include a read only memory unit (ROM) 5203. The memory unit 520 may also include a program/utility 5204 having a set (at least one) of program modules 5205, such program modules 5205 including, but not limited to: operating the electronic device, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 530 may be one or more of any of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 500 may also communicate with one or more external devices 100 (e.g., keyboard, display, network device, bluetooth device, etc.), enable a user to interact with the electronic device 500 via the external devices 100, and/or enable the electronic device 500 to communicate with one or more other data processing devices (e.g., router, modem, etc.). Such communication can occur via input/output (I/O) interfaces 550, and can also occur via network adapter 560 to one or more networks, such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet. The network adapter 560 may communicate with other modules of the electronic device 500 via the bus 530. It should be appreciated that although not shown in FIG. 5, other hardware and/or software modules may be used in the electronic device 500, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID electronics, tape drives, and data backup storage electronics, among others.

FIG. 6 is a schematic diagram of one computer-readable medium embodiment of the present invention. As shown in fig. 6, the computer program may be stored on one or more computer readable media. The computer readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may be, for example, but not limited to, an electronic device, apparatus, or device that is electronic, magnetic, optical, electromagnetic, infrared, or semiconductor, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. The computer program, when executed by one or more data processing devices, enables the computer-readable medium to implement the above-described method of the invention, namely: preprocessing historical target data to obtain a training set; dividing the training set into a plurality of sets of training subsets; for each training subset, training a corresponding fusion model by adopting a plurality of different models and fusion modes; respectively inputting target data to be recognized into each trained fusion model to obtain a plurality of initial security levels of the corresponding target to be recognized; and performing fusion calculation on the plurality of initial security levels of the target to be recognized to obtain the security level of the target to be recognized.

Through the description of the above embodiments, those skilled in the art will readily understand that the exemplary embodiments described in the present invention may be implemented by software, and may also be implemented by software in combination with necessary hardware. Therefore, the technical solution according to the embodiment of the present invention can be embodied in the form of a software product, which can be stored in a computer-readable storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to make a data processing device (which can be a personal computer, a server, or a network device, etc.) execute the above-mentioned method according to the present invention.

The computer readable storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution electronic device, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including object oriented programming languages such as Java, C + + or the like and conventional procedural programming languages, such as "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

In summary, the present invention can be implemented as a method, apparatus, electronic device, or computer-readable medium that executes a computer program. Some or all of the functions of the present invention may be implemented in practice using a general purpose data processing device such as a microprocessor or a Digital Signal Processor (DSP).

While the foregoing detailed description has described in detail certain embodiments of the invention with reference to certain specific aspects, embodiments and advantages thereof, it should be understood that the invention is not limited to any particular computer, virtual machine, or electronic device, as various general purpose machines may implement the invention. The invention is not to be considered as limited to the specific embodiments thereof, but is to be understood as being modified in all respects, all changes and equivalents that come within the spirit and scope of the invention.

Claims

1. A target security level identification method based on a fusion model is characterized by comprising the following steps:

preprocessing historical target data to obtain a training set;

dividing the training set into a plurality of sets of training subsets;

respectively inputting target data to be recognized into each trained fusion model to obtain a plurality of initial security levels of the corresponding target to be recognized;

2. The method of claim 1, wherein for each training subset, training a corresponding fusion model using a plurality of different models and fusion methods comprises:

aiming at each training subset, training by adopting models of different types and/or models of the same type and different parameters and a fusion mode; wherein, the models and the fusion modes adopted by the training subsets divided by the same training set are different.

3. The method of claim 1, wherein the training of the corresponding fusion model using the plurality of different models and fusion modes comprises:

generating a first model of a plurality of different parameters;

carrying out mean value fusion on the multiple groups of sub-fusion models and the corresponding output results to obtain corresponding fusion models;

dividing the training subset into N sub-datasets;

inputting the N sub-data sets into N second models with different parameters respectively for training to obtain N groups of sub-fusion models and N output results output by each sub-fusion model;

4. The method of claim 2, wherein prior to training the corresponding fusion model using the plurality of different models and fusion modes, the method further comprises:

determining a correlation coefficient of each model;

5. The method of claim 1, wherein the dividing the training set into a plurality of groups of training subsets comprises:

carrying out balancing treatment on the training set;

6. The method of claim 1, wherein preprocessing the historical target data to obtain a training set comprises:

and performing descriptive exploration analysis on the wide-table variables, and screening out characteristic data according to the analysis result to obtain a training set.

7. An apparatus for identifying a target security level based on a fusion model, the apparatus comprising:

a partitioning module for partitioning the training set into a plurality of groups of training subsets;

8. The device according to claim 7, wherein the training module is configured to perform training using different types of models and/or different types of models and fusion modes with the same type and different parameters for each training subset; wherein, the models and the fusion modes adopted by the training subsets divided by the same training set are different.

9. The apparatus of claim 7, wherein the training module comprises:

a first partitioning module, configured to partition the training subset into a training sub-data set and a validation sub-data set;

and/or, the training module comprises:

10. The apparatus of claim 8, further comprising:

11. The apparatus of claim 7, wherein the partitioning module comprises:

12. The apparatus of claim 7, wherein the pre-processing module comprises:

the cleaning module is used for cleaning data of the historical target data to obtain a wide-table variable;

13. An electronic device, comprising:

a processor; and

a memory storing computer-executable instructions that, when executed, cause the processor to perform the method of any of claims 1-6.

14. A computer readable storage medium, wherein the computer readable storage medium stores one or more programs which, when executed by a processor, implement the method of any of claims 1-6.