CN113298185B

CN113298185B - Model training method, abnormal file detection method, device, equipment and medium

Info

Publication number: CN113298185B
Application number: CN202110688044.8A
Authority: CN
Inventors: 郭开
Original assignee: Sangfor Technologies Co Ltd
Current assignee: Sangfor Technologies Co Ltd
Priority date: 2021-06-21
Filing date: 2021-06-21
Publication date: 2024-05-28
Anticipated expiration: 2041-06-21
Also published as: CN113298185A

Abstract

The application discloses a model training method, a device, an abnormal file detection method, electronic equipment and a computer readable storage medium, wherein the model training method comprises the following steps: selecting a plurality of pre-training samples from an original sample set, and training an initial model by using the pre-training samples to obtain a pre-training model; inputting each non-pre-training sample in the original sample set into a pre-training model to obtain a classification result and a characteristic value; sample simplifying processing is carried out on the non-pre-training samples based on the classification result and the characteristic value, so that a simplified sample set is obtained; training the pre-training model by using the simplified sample set to obtain a trained model; according to the method, the pre-training model is utilized to pre-classify the samples, and sample simplifying processing is further carried out based on the pre-classifying result, so that sample simplifying can be carried out efficiently, the time required by sample simplifying can be reduced, the efficiency is improved, and the model training efficiency is further improved.

Description

Model training method, abnormal file detection method, device, equipment and medium

Technical Field

The present application relates to the field of network model technology, and in particular, to a model training method, an abnormal file detection method, a model training apparatus, an electronic device, and a computer readable storage medium.

Background

In the use of a network model, the performance of the model depends not only on the design of the algorithm, but also on the quality of the training set that trains the model. Because a larger data set can make the model training time longer, in order to quickly train out a network model, the related technology generally adopts a manual verification mode to simplify samples. However, manual verification requires manual screening and labeling of samples, so that the method requires longer time and is low in efficiency. The related art has a problem in that the model cannot be efficiently trained.

Disclosure of Invention

Accordingly, the present application is directed to a model training method, an abnormal file detection method, a model training apparatus, an electronic device, and a computer readable storage medium, which are capable of pre-classifying samples by using a pre-training model, dividing a plurality of first sections, and performing sample compaction processing in each section, so that the time required for sample compaction can be reduced, the efficiency can be improved, and the model training efficiency can be further improved.

In order to solve the technical problems, the application provides a model training method, which specifically comprises the following steps:

selecting a plurality of pre-training samples from an original sample set, and training an initial model by using the pre-training samples to obtain a pre-training model;

inputting each non-pre-training sample in the original sample set into the pre-training model to obtain a classification result and a characteristic value;

sample simplifying processing is carried out on the non-pre-training samples based on the classification result and the characteristic value, so that a simplified sample set is obtained;

And training the pre-training model by utilizing the simplified sample set to obtain a trained model.

Optionally, the performing sample reduction processing on the non-pretrained sample based on the classification result and the feature value to obtain a reduced sample set includes:

dividing a plurality of characteristic value intervals based on the characteristic values;

And according to the classification result, sample simplifying processing is carried out on the non-pre-training samples in each characteristic value interval, so as to obtain the simplified sample set.

Optionally, the dividing the plurality of eigenvalue intervals based on the eigenvalues includes:

dividing a plurality of first intervals based on the characteristic values corresponding to the correct samples; the correct sample is a non-pre-training sample with correct classification results;

correspondingly, according to the classification result, performing sample reduction processing on the non-pre-training samples in each characteristic value interval to obtain the reduced sample set, including:

sample simplifying processing is carried out on the correct samples corresponding to each first interval, so that first simplified samples are obtained;

the reduced sample set is constructed based on the first reduced sample.

Optionally, the performing sample reduction processing on the correct samples corresponding to each first interval to obtain first reduced samples includes:

calculating the similarity between any two correct samples in each first interval, and determining the correct samples with the similarity larger than a similarity threshold as correct similar samples in the first interval;

and if the number of the correct similar samples is greater than a number threshold, deleting the correct similar samples to obtain the first reduced samples.

Dividing a plurality of second intervals based on the characteristic values corresponding to the error samples; the error sample is a non-pre-training sample with wrong classification result;

calculating the similarity between any two error samples in each second interval, and determining the error samples with the similarity larger than a similarity threshold as error similar samples in the second interval;

If the sample labels corresponding to the error similar samples have conflicts, deleting the error similar samples in the second interval to obtain a second simplified sample;

the reduced sample set is constructed based on the second reduced samples.

Optionally, the training the pre-training model by using the reduced sample set to obtain a trained model includes:

Performing weight increasing processing on the second reduced sample in the reduced sample set to obtain a weighted sample set;

And training the pre-training model by using the weighted sample set to obtain a trained model.

Optionally, the inputting each non-pre-training sample in the original sample set into the pre-training model to obtain a classification result and a feature value includes:

Inputting each non-preset training sample into the pre-training model to obtain the characteristic value and the recognition result;

If the identification result is matched with the sample label corresponding to the non-preset training sample, determining that the classification result is correct;

And if the identification result is not matched with the sample label, determining that the classification result is wrong.

The application also provides an abnormal file detection method, which comprises the following steps:

acquiring a file to be tested;

inputting the file to be detected into an abnormal file detection model to obtain a corresponding detection result; the abnormal file detection model is obtained through training based on the model training method.

The application also provides a model training device, which comprises:

the pre-training module is used for selecting a plurality of pre-training samples from the original sample set, and training an initial model by utilizing the pre-training samples to obtain a pre-training model;

The sample flyback module is used for inputting each non-pre-training sample in the original sample set into the pre-training model to obtain a classification result and a characteristic value;

The sample simplifying module is used for carrying out sample simplifying processing on the non-pre-training samples based on the classification result and the characteristic value to obtain a simplified sample set;

And the retraining module is used for training the pre-training model by utilizing the simplified sample set to obtain a trained model.

The application also provides an electronic device comprising a memory and a processor, wherein:

The memory is used for storing a computer program;

The processor is configured to execute the computer program to implement the model training method and/or the abnormal file detection method.

The application also provides a computer readable storage medium for storing a computer program, wherein the computer program realizes the model training method and/or the abnormal file detection method when being executed by a processor.

According to the model training method provided by the application, a plurality of pre-training samples are selected from an original sample set, and an initial model is trained by utilizing the pre-training samples to obtain a pre-training model; inputting each non-pre-training sample in the original sample set into a pre-training model to obtain a classification result and a characteristic value; sample simplifying processing is carried out on the non-pre-training samples based on the classification result and the characteristic value, so that a simplified sample set is obtained; training the pre-training model by using the simplified sample set to obtain a trained model.

It can be seen that the method simplifies the original sample set in a pre-training and classifying manner. Specifically, the pre-training model is obtained by selecting and training the pre-training sample, so that the pre-training model has the recognition capability on part of samples in the original sample set. By inputting each non-pre-training sample into the pre-training model, a corresponding classification result and a characteristic value can be obtained, the characteristic value can represent the characteristic of the non-pre-training sample, the classification result can represent whether the pre-training model has identification capability on the non-pre-training sample, and if the non-pre-training sample can be identified correctly, the pre-training model has a certain identification capability on the sample. Since the feature value is related to the feature and class of the sample, and the classification result may characterize the class of the sample. Therefore, the pre-training model can be determined to have the recognition capability on which samples based on the characteristic values and the classification results, and meanwhile, the pre-training model is determined to not have the recognition capability on which samples, so that efficient sample reduction is performed, namely sample reduction processing is performed on non-pre-training samples, and a reduced sample set is obtained. Because the pre-training model is only a model obtained after preliminary training, in order to obtain an accurate model, the pre-training model can be further trained by utilizing the simplified sample set, so as to obtain a trained model. The obtained trained model can be provided with the identification capability for various samples and has better performance. The pre-training model is utilized to pre-classify the samples, and then sample simplifying processing is carried out based on the pre-classifying result, so that the samples can be efficiently simplified, the number of the samples for model training is rapidly reduced, the model training efficiency is further improved, and the problem that the model cannot be rapidly and efficiently trained in the related technology is solved.

In addition, the application also provides a model training device, an abnormal file detection method, electronic equipment and a computer readable storage medium, and the model training device and the method have the same beneficial effects.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a model training method according to an embodiment of the present application;

FIG. 2 is a flowchart of a sample reduction process according to an embodiment of the present application;

FIG. 3 is a schematic structural diagram of a model training device according to an embodiment of the present application;

fig. 4 is a schematic diagram of a hardware composition framework to which a model training method according to an embodiment of the present application is applicable.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Referring to fig. 1, fig. 1 is a flowchart of a model training method according to an embodiment of the present application. The method comprises the following steps:

S101: and selecting a plurality of pre-training samples from the original sample set, and training an initial model by utilizing the pre-training samples to obtain a pre-training model.

In this embodiment, the original sample set is a sample set for training an initial model, which includes a plurality of samples. According to different purposes of the trained model obtained after training, samples in the original sample set may be different, so in practical application, samples in the original sample set may be selected according to needs, for example, may be an audio sample, an image sample, a file sample, a data sample, and the like, which is not limited in this embodiment. Accordingly, the model after training may have functions of classification, identification, feature extraction, etc., and the specific type, architecture, etc. of the initial model before training, the embodiment is not limited.

In order to improve the efficiency of model training and avoid the problem of overlong training time caused by training the model by utilizing massive training samples, part of training samples can be selected from the original sample set to serve as pre-training samples. The pre-training samples are a part of samples in the original sample set, and the specific selection mode is not limited, for example, the number of the pre-training samples can be determined according to the number of the training samples in the original sample set, and a plurality of training samples of the pre-training samples are randomly selected from the original sample set to be used as the pre-training samples. The pre-training sample is used for training an initial model, namely an untrained model, and the initial model is trained to be converged, namely converted into a trained model. The initial model can be converted into a pre-training model after training of the pre-training sample, and the pre-training model can have identification capability on training samples of partial categories in the original sample set because the pre-training sample is a part of the original sample set. The specific process of training is not limited, and reference may be made to the related art.

S102: and inputting each non-pre-training sample in the original sample set into a pre-training model to obtain a classification result and a characteristic value.

Other training samples in the original sample set that are not determined to be pre-training samples are non-pre-training samples, and a large number of non-pre-training samples exist in the original sample set besides the pre-training samples, and a large number of non-pre-training samples similar to the pre-training samples may exist in the original sample set due to the fact that the pre-training model has identification capability on the pre-training samples. In order to determine which non-pre-training samples the pre-training model has identification capability, each non-pre-training sample can be sequentially input into the pre-training model, and the pre-training model is utilized to process the non-pre-training samples to obtain corresponding classification results and characteristic values. The feature value corresponds to the feature of the training sample, two training samples with similar feature values have similar features in high probability, and the feature values corresponding to the two similar samples are necessarily similar. The classification result is used for indicating whether the pre-training model has the recognition capability on the non-pre-training sample, the specific form is not limited, for example, the non-pre-training sample can be correct or incorrect, and if the non-pre-training sample is correct, the pre-training model has the recognition capability and can be recognized; an error indicates that the non-pre-trained model cannot be correctly identified.

S103: and carrying out sample simplifying treatment on the non-pre-training samples based on the classification result and the characteristic value to obtain a simplified sample set.

The sample simplifying process refers to the process of deleting or reducing the number of similar samples in the non-pre-training samples so as to reduce the samples of the same type. Because the pre-training model is trained to a certain extent, although the model is not trained, the model inevitably has certain recognition capability on certain types of samples. If a large number of samples of the type are continuously used for training the pre-training model, the improvement of the model performance is small, and the wasted calculation resources and time are more, so that the model training efficiency is low. In addition, in another case, if there are samples with inaccurate marks in the samples used for training the pre-training model, the model cannot have stronger performance after a large amount of training, and more time and computing resources are required to be consumed for continuous training. Therefore, in order to improve the model training efficiency, after the classification result and the characteristic value corresponding to each non-pre-training sample are obtained, sample compaction can be performed on the non-pre-training sample based on the classification result and the characteristic value, the number of samples is reduced, and valuable samples are reserved to form a compact sample set.

It will be appreciated that the sample compaction process is based on the classification result and the feature value, and the classification result and/or the feature value need to be used as a standard in the compaction process. For example, in one embodiment, the non-pre-training samples may be divided into two classes according to the classification result, and then sample reduction is performed in the two classes of non-pre-training samples, respectively, and in another embodiment, the sample reduction process may include the following steps:

step 11: the plurality of eigenvalue intervals are divided based on the eigenvalues.

Step 12: and according to the classification result, sample simplifying processing is carried out on the non-pre-training samples in each characteristic value interval, so as to obtain a simplified sample set.

In this embodiment, the efficiency of the sample reduction process is further improved. Multiple eigenvalue intervals may be partitioned based on eigenvalues. The purpose of sample refinement is to reduce the number of samples of the type that the model is already able to recognize (which may be referred to as correct samples, i.e. respectively resulting in correct non-pre-training samples), reducing the time required for training by reducing the number of samples of this type. In addition, the method can also comprise the step of screening out samples with inaccurate marks, so that the interference to model training is reduced by reducing the inaccurate samples, and further the time required for training is reduced.

Therefore, in sample reduction, it is necessary to determine whether two correct samples are sufficiently similar. The feature value can characterize the features of the sample, which are obtained based on the original characteristics of the sample and are not affected by the label, so that the more similar the feature value, the more likely it is that the two samples are similar, and the more dissimilar the two samples are. Based on the intervals obtained by dividing the characteristic values, different characteristic value intervals cover different characteristic value ranges, and the more likely the samples in the same interval are similar. After the characteristic value interval is divided, according to the classification result, the sample simplifying treatment can be carried out on the non-pre-training samples in each characteristic value interval in different modes according to the correct or incorrect classification result, and finally a simplified sample set is obtained. By dividing the characteristic value interval, the number of times of judging whether the two samples are similar enough can be reduced, so that the consumption of computing resources is reduced, and the model training efficiency can be improved when a model is trained by using a large-scale data set.

The embodiment does not limit the specific division mode of the characteristic value interval, and is beneficial to dividing a plurality of characteristic value intervals with equal length according to the preset length; in another embodiment, the number of samples in each eigenvalue interval may be the same or similar according to the specific distribution of eigenvalues.

S104: training the pre-training model by using the simplified sample set to obtain a trained model.

The number of samples in the simplified sample set obtained after sample reduction processing is smaller, so that the time required for retraining the pre-trained model by using the sample reduction processing is less, and the generation efficiency of the trained model is higher.

By applying the model training method provided by the embodiment of the application, the pre-training model is obtained by selecting the pre-training sample and training the pre-training sample, so that the pre-training model has the recognition capability on part of samples in the original sample set. By inputting each non-pre-training sample into the pre-training model, a corresponding classification result and a characteristic value can be obtained, the characteristic value can represent the characteristic of the non-pre-training sample, the classification result can represent whether the pre-training model has identification capability on the non-pre-training sample, and if the non-pre-training sample can be identified correctly, the pre-training model has a certain identification capability on the sample. Since the feature value is related to the feature and class of the sample, and the classification result may characterize the class of the sample. Therefore, the pre-training model can be determined to have the recognition capability on which samples based on the characteristic values and the classification results, and meanwhile, the pre-training model is determined to not have the recognition capability on which samples, so that efficient sample reduction is performed, namely sample reduction processing is performed on non-pre-training samples, and a reduced sample set is obtained. Because the pre-training model is only a model obtained after preliminary training, in order to obtain an accurate model, the pre-training model can be further trained by utilizing the simplified sample set, so as to obtain a trained model. The obtained trained model can be provided with the identification capability for various samples and has better performance. The pre-training model is utilized to pre-classify the samples, and then sample simplifying processing is carried out based on the pre-classifying result, so that the samples can be efficiently simplified, the number of the samples for model training is rapidly reduced, the model training efficiency is further improved, and the problem that the model cannot be rapidly and efficiently trained in the related technology is solved.

Based on the above embodiments, the present embodiment will specifically explain several steps in the above embodiments. In one embodiment, when training the pre-training model, a classification model can be trained simultaneously, so that a corresponding classification result can be generated by using the classification model later; in another embodiment, to improve accuracy of the classification result, the process of inputting each non-pre-training sample in the original sample set into the pre-training model to obtain the classification result and the feature value may include the following steps:

Step 21: and inputting each non-preset training sample into the pre-training model to obtain a characteristic value and a recognition result.

In this embodiment, after the non-preset training sample is input into the pre-training model, the pre-training model may extract and output the corresponding feature value thereof, and identify the non-pre-training sample according to the pre-training condition, so as to obtain the corresponding identification result. Because the recognition capability of the pre-training model is limited, the obtained recognition result is not necessarily accurate, and the situation that the recognition result is not matched with a sample label may occur.

Step 22: if the identification result is matched with the sample label corresponding to the non-preset training sample, the classification result is determined to be correct.

If the recognition result is matched with the sample label, the pre-training model can be determined to have the recognition capability on the non-preset training sample, so that the classification result can be determined to be correct.

Step 23: if the identification result is not matched with the sample label, determining that the classification result is wrong.

If the recognition result is not matched with the sample label, the recognition error of the pre-training model to the non-pre-training sample is indicated, so that the classification result is determined to be wrong, the pre-training model does not have the recognition capability to the sample, and the pre-training model needs to be trained by using the sample later. By using the method, accurate classification results can be obtained, so that a trained model with better performance can be obtained after the pre-trained model is trained again later.

Based on the above embodiment, for the non-pre-training sample with correct classification result, i.e. the correct sample, since the pre-training model already has the ability to identify the sample of this type, a large number of samples of the same type need not be used for training in the subsequent training, and the number thereof can be reduced. In this case, the process of dividing the plurality of eigenvalue intervals based on the eigenvalues may include the steps of:

step 31: and dividing a plurality of first intervals based on the characteristic values corresponding to the correct samples.

The correct sample is a non-pre-training sample with correct classification result. In order to improve the effectiveness of sample simplification, in the process of simplifying the correct sample, when dividing the characteristic value interval, the corresponding characteristic value interval is divided only based on the characteristic value corresponding to the correct sample, namely the first interval.

Correspondingly, according to the classification result, sample simplifying processing is carried out on the non-pre-training samples in each characteristic value interval, and the process of obtaining the simplified sample set can comprise the following steps:

Step 32: and carrying out sample simplifying processing on the correct samples corresponding to each first interval to obtain first simplified samples.

Step 33: a reduced sample set is constructed based on the first reduced sample.

After the first intervals are obtained by dividing, sample simplification is carried out on correct samples in each first interval respectively so as to reduce the number of the correct samples, further obtain first simplified samples, and a simplified sample set is formed by using the first simplified samples. The embodiment is not limited to a specific sample compacting manner, for example, in an implementation manner, a process of performing sample compacting processing on correct samples corresponding to each first interval to obtain first compacted samples may include the following steps:

Step 41: respectively calculating the similarity between any two correct samples in each first interval, and determining the correct samples with the similarity larger than a similarity threshold value as the correct similar samples in the first interval;

step 42: if the number of the correct similar samples is larger than the number threshold, deleting the correct similar samples to obtain a first simplified sample.

The feature values can reflect the features of the samples to a certain extent, so that samples with similar feature values are similar to a certain extent, and thus each correct sample in the first interval can be regarded as a relatively similar sample, but in the present embodiment, not all of them are regarded as sufficiently similar samples, where there may be small differences, and if they are subtracted, the training effect of the model is affected. Therefore, in order to accurately sample compaction, the similarity between each correct sample in each first interval can be calculated, and whether the similarity is greater than a similarity threshold value can be determined. The specific size of the similarity threshold is not limited. If the similarity is greater than the similarity threshold, determining two correct samples for calculating the similarity as correct similar samples, wherein the correct similar samples are correct samples with enough similarity.

If the number of correctly similar samples is greater than the number threshold, it is indicated that in the first interval, there are more sufficiently similar samples, and the reduction can be performed. The remaining correct samples are determined to be the first reduced samples by pruning the correct similar samples. It is understood that the pruning process is to prune between a plurality of correct similar samples corresponding to each first interval. By using the method, whether the correct samples are similar enough or not can be further determined by using the similarity, so that sample simplifying processing can be more accurate.

Based on the above embodiment, in a possible implementation manner, in the case that there may be a marking error in a non-pre-training sample in the original sample set, in order to reduce interference in the model training process and improve training efficiency, the process of dividing a plurality of eigenvalue intervals based on eigenvalues may include the following steps:

Step 51: and dividing a plurality of second intervals based on the characteristic values corresponding to the error samples.

The erroneous samples are non-pre-trained samples with erroneous classification results. Note that, in the present embodiment, the division manner of the second section may be the same as or different from the division manner of the first section, and the specific division manner is not limited.

Step 52: and respectively calculating the similarity between any two error samples in each second interval, and determining the error samples with the similarity larger than the similarity threshold value as error similar samples in the second interval.

Step 53: if the sample labels corresponding to the error similar samples have conflicts, deleting the error similar samples in the second interval to obtain a second simplified sample.

Step 54: a reduced sample set is constructed based on the second reduced sample. Similar to the division manner, in this embodiment, the similarity calculation manner between two error samples may be the same or different from the similarity calculation manner between the two correct samples, and the similarity may represent the similarity between the two samples.

After determining the error similar samples, it is determined whether the sample tags of the respective error similar samples collide. If there is a conflict, it is indicated that there is a label marking error, in which case it is indicated that none of the cluster of error-like samples is authentic, and therefore the error-like samples are deleted. Sample tag conflicts, i.e., multiple tags that are erroneous similar samples, are not exactly identical. And deleting the samples which are remained after the error similar samples are the second reduced samples, wherein the second reduced samples are used for forming a reduced sample set.

Further, in order to further improve the training efficiency of the model, to make the model quickly have the ability of identifying the error sample, the training of the pre-training model by using the simplified sample set may include:

step 61: and performing weight increasing processing on the second reduced sample in the reduced sample set to obtain a weighted sample set.

Because the pre-training model cannot correctly identify the second reduced sample, in order to enhance the identification capability of the pre-training model on the second reduced sample, the weight of the second reduced sample can be increased, that is, the weight of the second reduced sample is increased, so as to obtain a weighted sample set. The embodiment of increasing the weight of the second reduced sample is not limited, and reference may be made to the related art.

Step 62: and training the pre-training model by using the weighted sample set to obtain a trained model.

By weighting the second simplified sample, the training of the pre-training model by using the weighted sample set can be performed so as to pay more attention to the learning of the second simplified sample, the recognition capability of the second simplified sample is improved, and the performance of the trained model is further improved.

Based on the above embodiments, after the trained model is obtained, further processing may be performed, for example, in one embodiment, the trained model may be sent to other electronic devices, so that the other electronic devices may complete tasks corresponding to the model functions using the trained model. Specifically, the method can further comprise the following steps:

Step 71: and acquiring target equipment information, and sending the trained model to target equipment corresponding to the target equipment information.

The specific form of the target device information is not limited, and may be, for example, a device network address or a device number.

In another embodiment, the trained model may be applied to detect an abnormal file, such as a virus file, so the embodiment further provides an abnormal file detection method, which specifically includes:

step 81: and obtaining the file to be tested.

Step 82: and inputting the file to be detected into an abnormal file detection model to obtain a corresponding detection result.

The abnormal file detection model is obtained through training based on the model training method.

Referring to fig. 2, fig. 2 is a flowchart illustrating a specific sample compacting process according to an embodiment of the present application. The original data set is an original sample set, and the original data set is divided according to a proportion to obtain a pre-training data set, wherein the pre-training data set comprises at least two pre-training samples. By using the model training method to train a machine learning model (namely an initial model), a convergent basic model, namely a pre-training model, can be obtained by using shorter events. The part of the original data set which is not divided into the pre-training data set is input into the pre-training model, so that a corresponding prediction result and a corresponding prediction score (i.e. a characteristic value) are obtained. Classification is performed based on the prediction results to obtain a correct sample (i.e., a correctly predicted sample) and a wrong sample (i.e., a incorrectly predicted sample). And respectively demarcating a first interval and a second interval by predicting scores, calculating the similarity between samples in the scoring intervals, and determining an error similar sample and a correct similar sample. For the error sample, if the labels of the error similar samples have conflicts, deleting the cluster error similar samples from the training set; if the sample labels are consistent, the weight of the erroneous sample in the training set is increased. For predicting correct samples, the correct similar samples are also found in the interval, if the number of the correct similar samples is larger than the threshold value, the cluster samples are too many, and the cluster samples can be pruned according to a certain proportion. And finally, a simplified data set is obtained, and the pre-training model is trained by utilizing the simplified data set, so that a trained model is obtained.

The following describes a model training apparatus provided in an embodiment of the present application, and the model training apparatus described below and the model training method described above may be referred to correspondingly.

Referring to fig. 3, fig. 3 is a schematic structural diagram of a model training device according to an embodiment of the present application, including:

a pre-training module 110, configured to select a plurality of pre-training samples from an original sample set, and train an initial model by using the pre-training samples to obtain a pre-training model;

a sample retrace module 120, configured to input each non-pre-training sample in the original sample set into the pre-training model to obtain a classification result and a feature value;

The sample simplifying module 130 is configured to perform sample simplifying processing on the non-pre-training sample based on the classification result and the feature value, so as to obtain a simplified sample set;

And the retraining module 140 is configured to train the pre-trained model by using the reduced sample set, so as to obtain a trained model.

Optionally, the sample reduction module 130 includes:

A section dividing unit for dividing a plurality of characteristic value sections based on the characteristic values;

and the simplifying unit is used for carrying out sample simplifying processing on the non-pre-training samples in each characteristic value interval according to the classification result to obtain the simplified sample set.

Optionally, the section dividing unit includes:

a first dividing subunit, configured to divide a plurality of first intervals based on the feature values corresponding to the correct samples; the correct sample is a non-pre-training sample with correct classification results;

correspondingly, the simplifying unit comprises:

the first simplifying subunit is used for carrying out sample simplifying processing on the correct samples corresponding to each first interval to obtain first simplified samples;

A first construction subunit for constructing the reduced sample set based on the first reduced samples.

Optionally, the first compaction subunit includes:

A similarity calculating subunit, configured to calculate a similarity between any two of the correct samples in each first interval, and determine the correct samples with the similarity greater than a similarity threshold as correct similar samples in the first interval;

And the pruning processing subunit is used for pruning the correct similar samples to obtain the first pruned samples if the number of the correct similar samples is greater than a number threshold.

Optionally, the section dividing unit includes:

a second dividing subunit, configured to divide a plurality of second intervals based on the feature values corresponding to the error samples; the error sample is a non-pre-training sample with wrong classification result;

correspondingly, the simplifying unit comprises:

an error similar sample determining subunit, configured to calculate a similarity between any two error samples in each second interval, and determine, as error similar samples in the second interval, the error samples with the similarity greater than a similarity threshold;

a deleting subunit, configured to delete the error similar samples in the second interval if there is a conflict in the sample labels corresponding to the error similar samples, so as to obtain a second simplified sample;

a second construction subunit configured to construct the reduced sample set based on the second reduced sample.

Optionally, the retraining module 140 includes:

the weighting unit is used for carrying out weight increasing processing on the second reduced sample in the reduced sample set to obtain a weighted sample set;

and the training unit is used for training the pre-training model by using the weighted sample set to obtain a trained model.

Optionally, the sample retrace module 120 includes:

The input unit is used for inputting each non-preset training sample into the pre-training model to obtain the characteristic value and the recognition result;

The correct determining unit is used for determining that the classification result is correct if the identification result is matched with the sample label corresponding to the non-preset training sample;

and the error determining unit is used for determining that the classification result is wrong if the identification result is not matched with the sample label.

The abnormal file detection device provided by the embodiment of the application is introduced below, and the model training device described below and the abnormal file detection method described above can be referred to correspondingly.

The embodiment also provides an abnormal file detection device, which comprises:

The file to be tested obtaining module is used for obtaining the file to be tested;

The file detection module is used for inputting the file to be detected into the abnormal file detection model to obtain a corresponding detection result; the abnormal file detection model is obtained through training based on the model training method.

The electronic device provided by the embodiment of the application is introduced below, and the electronic device described below and the model training method described above can be referred to correspondingly.

Referring to fig. 4, fig. 4 is a schematic diagram of a hardware composition framework to which a model training method according to an embodiment of the present application is applicable. Wherein the electronic device 100 may include a processor 101 and a memory 102, and may further include one or more of a multimedia component 103, an information input/information output (I/O) interface 104, and a communication component 105.

Wherein the processor 101 is configured to control the overall operation of the electronic device 100 to perform all or part of the steps in the model training method described above; the memory 102 is used to store various types of data to support operation at the electronic device 100, which may include, for example, instructions for any application or method operating on the electronic device 100, as well as application-related data. The Memory 102 may be implemented by any type or combination of volatile or non-volatile Memory devices, such as one or more of static random access Memory (Static Random Access Memory, SRAM), electrically erasable programmable Read-Only Memory (ELECTRICALLY ERASABLE PROGRAMMABLE READ-Only Memory, EEPROM), erasable programmable Read-Only Memory (Erasable Programmable Read-Only Memory, EPROM), programmable Read-Only Memory (Programmable Read-Only Memory, PROM), read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk, or optical disk. In the present embodiment, at least programs and/or data for realizing the following functions are stored in the memory 102:

inputting each non-pre-training sample in the original sample set into a pre-training model to obtain a classification result and a characteristic value;

Training the pre-training model by using the simplified sample set to obtain a trained model.

The multimedia component 103 may include a screen and an audio component. Wherein the screen may be, for example, a touch screen, the audio component being for outputting and/or inputting audio signals. For example, the audio component may include a microphone for receiving external audio signals. The received audio signals may be further stored in the memory 102 or transmitted through the communication component 105. The audio assembly further comprises at least one speaker for outputting audio signals. The I/O interface 104 provides an interface between the processor 101 and other interface modules, which may be a keyboard, mouse, buttons, etc. These buttons may be virtual buttons or physical buttons. The communication component 105 is used for wired or wireless communication between the electronic device 100 and other devices. Wireless Communication, such as Wi-Fi, bluetooth, near field Communication (NFC for short), 2G, 3G or 4G, or a combination of one or more thereof, the corresponding Communication component 105 may thus comprise: wi-Fi part, bluetooth part, NFC part.

The electronic device 100 may be implemented by one or more Application Specific Integrated Circuits (ASIC), digital signal Processor (DIGITAL SIGNAL Processor, DSP), digital signal processing device (DIGITAL SIGNAL Processing Device, DSPD), programmable logic device (Programmable Logic Device, PLD), field programmable gate array (Field Programmable GATE ARRAY, FPGA), controller, microcontroller, microprocessor, or other electronic components for performing the model training method as set forth in the above embodiments.

Of course, the structure of the electronic device 100 shown in fig. 4 is not limited to the electronic device in the embodiment of the present application, and the electronic device 100 may include more or less components than those shown in fig. 4 or may combine some components in practical applications.

The following describes a computer readable storage medium provided in an embodiment of the present application, where the computer readable storage medium described below and the model training method described above may be referred to correspondingly.

The application also provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program, and the computer program realizes the steps of the model training method when being executed by a processor.

The computer readable storage medium may include: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, so that the same or similar parts between the embodiments are referred to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Those skilled in the art may implement the described functionality using different approaches for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

Finally, it is further noted that, in this document, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms include, comprise, or any other variation is intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

The principles and embodiments of the present application have been described herein with reference to specific examples, the description of which is intended only to assist in understanding the methods of the present application and the core ideas thereof; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.

Claims

1. A method of model training, comprising:

selecting a plurality of pre-training file samples from an original sample set, and training an initial model by utilizing the pre-training file samples to obtain a pre-training model;

Inputting each non-pre-training file sample in the original sample set into the pre-training model to obtain a classification result and a characteristic value; the classification result is used for indicating whether the pre-training model has identification capability on the non-pre-training file sample, and the characteristic value corresponds to the characteristic of the non-pre-training file sample;

According to the classification result, sample simplifying processing is carried out on the non-pre-training file samples in each characteristic value interval to obtain a simplified sample set;

training the pre-training model by utilizing the simplified sample set to obtain an abnormal file detection model, wherein the abnormal file detection model is used for detecting whether a file is an abnormal file or not.

2. The model training method according to claim 1, wherein the dividing the plurality of eigenvalue intervals based on the eigenvalues includes:

Dividing a plurality of first intervals based on the characteristic values corresponding to the correct samples; the correct sample is a non-pre-training file sample with correct classification results;

correspondingly, according to the classification result, sample simplifying processing is performed on the non-pre-training file samples in each characteristic value interval to obtain a simplified sample set, including:

the reduced sample set is constructed based on the first reduced sample.

3. The method of claim 2, wherein the performing sample reduction processing on the correct samples corresponding to each of the first intervals to obtain first reduced samples includes:

4. The model training method according to claim 1, wherein the dividing the plurality of eigenvalue intervals based on the eigenvalues includes:

Dividing a plurality of second intervals based on the characteristic values corresponding to the error samples; the error sample is a non-pre-training file sample with wrong classification results;

the reduced sample set is constructed based on the second reduced samples.

5. The model training method of claim 4, wherein training the pre-training model using the reduced sample set to obtain an anomaly file detection model comprises:

And training the pre-training model by using the weighted sample set to obtain the abnormal file detection model.

6. The method for training a model according to claim 1, wherein said inputting each non-pre-training file sample in the original sample set into the pre-training model to obtain a classification result and a feature value comprises:

inputting each non-pre-training file sample into the pre-training model to obtain the characteristic value and the identification result;

If the identification result is matched with the sample label corresponding to the non-pre-training file sample, determining that the classification result is correct;

7. An abnormal file detection method, comprising:

acquiring a file to be tested;

Inputting the file to be detected into an abnormal file detection model to obtain a corresponding detection result; the abnormal file detection model is trained based on the model training method according to any one of claims 1 to 6, and the detection result is used for representing whether the file to be detected is an abnormal file or not.

8. A model training device, comprising:

The pre-training module is used for selecting a plurality of pre-training file samples from the original sample set, and training an initial model by utilizing the pre-training file samples to obtain a pre-training model;

The sample flyback module is used for inputting each non-pre-training file sample in the original sample set into the pre-training model to obtain a classification result and a characteristic value; the classification result is used for indicating whether the pre-training model has identification capability on the non-pre-training file sample, and the characteristic value corresponds to the characteristic of the non-pre-training file sample;

The sample simplifying module is used for dividing a plurality of characteristic value intervals based on the characteristic values, and carrying out sample simplifying processing on the non-pre-training file samples in each characteristic value interval according to the classification result to obtain a simplified sample set;

The retraining module is used for training the pre-training model by utilizing the simplified sample set to obtain an abnormal file detection model, and the abnormal file detection model is used for detecting whether the file is an abnormal file or not.

9. An electronic device comprising a memory and a processor, wherein:

The memory is used for storing a computer program;

the processor is configured to execute the computer program to implement the model training method according to any one of claims 1 to 6 and/or the abnormal file detection method according to claim 7.

10. A computer readable storage medium for storing a computer program, wherein the computer program when executed by a processor implements the model training method of any one of claims 1 to 6 and/or the abnormal file detection method of claim 7.