CN113537262A - Data analysis method, device, equipment and readable storage medium - Google Patents

Data analysis method, device, equipment and readable storage medium Download PDF

Info

Publication number
CN113537262A
CN113537262A CN202010313979.3A CN202010313979A CN113537262A CN 113537262 A CN113537262 A CN 113537262A CN 202010313979 A CN202010313979 A CN 202010313979A CN 113537262 A CN113537262 A CN 113537262A
Authority
CN
China
Prior art keywords
sample
sample set
samples
analysis model
type
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010313979.3A
Other languages
Chinese (zh)
Inventor
刘彦南
郑坚秋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sangfor Technologies Co Ltd
Original Assignee
Sangfor Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sangfor Technologies Co Ltd filed Critical Sangfor Technologies Co Ltd
Priority to CN202010313979.3A priority Critical patent/CN113537262A/en
Publication of CN113537262A publication Critical patent/CN113537262A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"

Abstract

The invention discloses a data analysis method, which comprises the following steps: analyzing the types of the plurality of first samples by adopting a first analysis model to obtain the prediction type of each first sample and obtain the actual type of each first sample; determining an abnormal sample set according to a first sample with a difference between the prediction type and the actual type; obtaining a sample at least comprising an abnormal sample set as a first sample set; constructing a second analysis model by using the first sample set as a training sample; and predicting the type of the sample to be detected according to the first analysis model and the second analysis model. The invention also discloses a data analysis device, data analysis equipment and a readable storage medium. The invention aims to realize the automatic processing of the missing and false alarm data, thereby improving the accuracy of data pre-judgment based on a machine learning model.

Description

Data analysis method, device, equipment and readable storage medium
Technical Field
The present invention relates to the field of data analysis, and more particularly, to a data analysis method, a data analysis apparatus, a data analysis device, and a readable storage medium.
Background
With the development of the technology, the artificial intelligence technology is applied to more and more detection scenes, such as computer security detection scenes, and a machine learning model is generally constructed by collecting training samples, and the machine learning model is adopted to pre-judge the type of unknown data.
However, due to insufficient training, incorrect super-parameter setting, randomness of the training algorithm, and the like, the machine learning model often generates data with missing reports and false reports in the actual application process. At present, data which are missed and misreported are generally processed by a large amount of intervention modes such as manual black and white list making, and then new missed and misreported data still need new manual intervention and manual processing, so that the efficiency is low, the generalization capability is low, and the accuracy of data pre-judgment based on a machine learning model is difficult to ensure.
Disclosure of Invention
The invention mainly aims to provide a data analysis method, which aims to realize automatic processing of misreported data, thereby improving the accuracy of data pre-judgment based on a machine learning model.
In order to achieve the above object, the present invention provides a data analysis method, including the steps of:
analyzing the types of the plurality of first samples by adopting a first analysis model to obtain the prediction type of each first sample and obtain the actual type of each first sample;
determining an abnormal sample set according to a first sample with a difference between the prediction type and the actual type;
acquiring a sample at least comprising the abnormal sample set as a first sample set;
constructing a second analysis model by using the first sample set as a training sample;
and predicting the type of the sample to be detected according to the first analysis model and the second analysis model.
Optionally, after the step of analyzing the types of the plurality of first samples by using the first analysis model to obtain the predicted type of each first sample, obtaining the actual type of each first sample, the method further includes:
determining a normal sample set according to a first sample with the prediction type matched with the actual type;
the step of obtaining samples comprising at least the set of abnormal samples as a first set of samples comprises:
and selecting a part of samples from the normal sample set to be mixed with the abnormal sample set to obtain a first sample set.
Optionally, the step of selecting a part of samples from the normal sample set and mixing the part of samples with the abnormal sample set to obtain a first sample set includes:
based on a quantity balance rule, selecting a part of samples from the normal sample set to be mixed with the abnormal sample set to obtain the first sample set; the number balancing rule is that the difference value of the total amount of the samples corresponding to each actual type in the first sample set is smaller than or equal to a preset value.
Optionally, the step of selecting a part of samples from the normal sample set to be mixed with the abnormal sample set based on a quantity balancing rule to obtain a first sample set includes:
and randomly extracting a part of samples in the normal sample set and mixing with the abnormal sample set based on the quantity balancing rule to obtain the first sample set.
Optionally, the step of selecting a part of samples from the normal sample set to be mixed with the abnormal sample set based on a quantity balancing rule to obtain a first sample set includes:
defining each sample in the normal sample set as a first sample, defining a sample matched with the actual type of the first sample in the abnormal sample set as a second sample, and acquiring a first distance threshold corresponding to each actual type;
in the first samples corresponding to the actual types, determining the first samples of which Euclidean distances to each corresponding second sample are less than or equal to the first distance threshold value as samples to be mixed;
and when the samples to be mixed corresponding to the actual types meet the quantity balance rule, mixing the samples in the abnormal sample set with the corresponding samples to be mixed according to the actual types to obtain the first sample set.
Optionally, after the step of determining, in the first samples corresponding to each of the actual types, the first samples whose euclidean distances to each corresponding second sample are less than or equal to the first distance threshold as the samples to be mixed, the method further includes:
judging whether each sample to be mixed meets the quantity balancing rule or not;
if so, executing the step of mixing the samples in the abnormal sample set with the corresponding samples to be mixed according to each actual type to obtain the first sample set;
if not, determining that the actual type corresponding to the sample to be mixed which does not meet the quantity balance rule is the target type;
adjusting a first distance threshold corresponding to the target type;
and returning to execute the first samples corresponding to the actual types, and determining the first samples of which the Euclidean distances to each corresponding second sample are less than or equal to the first distance threshold value as samples to be mixed.
Optionally, the step of predicting the type of the sample to be tested according to the first analysis model and the second analysis model includes:
fusing the first analysis model and the second analysis model to obtain a third analysis model;
and analyzing the sample to be detected by adopting the first analysis model, the second analysis model and the third analysis module, and determining the type of the sample to be detected.
Optionally, the step of fusing the first analysis model and the second analysis model to obtain a third analysis model includes:
acquiring a training sample set corresponding to the first analysis model as a third sample set;
mixing the first sample set and the third sample set to obtain a fourth sample set;
performing feature extraction on the fourth sample set by using the first analysis model to obtain a first feature parameter, and performing feature extraction on the fourth sample set by using the second analysis model to obtain a second feature parameter;
taking the first characteristic parameter, the second characteristic parameter and the actual type corresponding to each sample in the fourth sample set as characteristic parameters;
and constructing the third analysis model by using the characteristic parameters as training samples.
Optionally, the step of analyzing the sample to be tested by using the first analysis model, the second analysis model, and the third analysis module, and determining the type of the sample to be tested includes:
performing feature extraction on the sample to be detected by adopting the first analysis model to obtain a third feature parameter, and performing feature extraction on the sample to be detected by adopting the second analysis model to obtain a fourth feature parameter;
and inputting the third characteristic parameter and the fourth characteristic parameter into the third analysis model, and taking a result output by the third analysis model as the type of the sample to be detected.
Optionally, the step of predicting the type of the sample to be tested according to the first analysis model and the second analysis model includes:
judging whether Euclidean distances between each sample in the abnormal sample set and the sample to be detected are smaller than or equal to the second distance threshold value;
if so, predicting the type of the sample to be detected by adopting the second analysis model;
if not, predicting the type of the sample to be detected by adopting the first analysis model.
In order to achieve the above object, the present application also proposes a data analysis device including:
the analysis module is used for analyzing the types of the plurality of first samples by adopting a first analysis model to obtain the prediction type of each first sample and obtain the actual type of each first sample;
the anomaly analysis module is used for determining an anomaly sample set according to a first sample with a difference between a prediction type and an actual type;
a sample set generating module, configured to obtain a sample at least including the abnormal sample set as a first sample set;
the modeling module is used for constructing a second analysis model by taking the first sample set as a training sample;
and the prediction module predicts the type of the sample to be detected according to the first analysis model and the second analysis model.
Further, in order to achieve the above object, the present application also proposes a data analysis apparatus comprising: a memory, a processor and a data analysis program stored on the memory and executable on the processor, the data analysis program when executed by the processor implementing the steps of the data analysis method as claimed in any one of the above.
In addition, in order to achieve the above object, the present application also proposes a readable storage medium having stored thereon a data analysis program which, when executed by a processor, implements the steps of the data analysis method as described in any one of the above.
The invention provides a data analysis method, which comprises the steps of analyzing the types of a plurality of first samples through a first analysis model to obtain the prediction type of each first sample, obtaining the actual type of each first sample, determining an abnormal sample set according to the first samples with the difference between the prediction type and the actual type, constructing a second analysis model by taking the first sample set at least comprising the abnormal sample set as a training sample, and predicting the types of samples to be tested according to the first analysis model and the second analysis model. In the method, the abnormal sample in the prediction result of the first analysis model can directly construct a new second analysis model based on at least the training sample of the abnormal sample set, the type of the sample to be detected is predicted by combining the first analysis model and the second analysis model, the missing report and the false report of the first analysis model can be processed without manual intervention in the process, and the condition of missing report and false report is applied to the prediction of the new sample to be detected, so that the accuracy of data prediction based on a machine learning model is ensured.
Drawings
FIG. 1 is a schematic structural diagram of a hardware operating environment related to a data analysis device according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart diagram illustrating an embodiment of a data analysis method according to the present invention;
FIG. 3 is a schematic flow chart diagram illustrating another embodiment of a data analysis method according to the present invention;
FIG. 4 is a detailed flowchart of step S30a in FIG. 3;
FIG. 5 is a schematic flow chart diagram illustrating a data analysis method according to another embodiment of the present invention;
FIG. 6 is a flow chart illustrating a data analysis method according to another embodiment of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The main solution of the embodiment of the invention is as follows: analyzing the types of the plurality of first samples by adopting a first analysis model to obtain the prediction type of each first sample and obtain the actual type of each first sample; determining an abnormal sample set according to a first sample with a difference between the prediction type and the actual type; acquiring a sample at least comprising the abnormal sample set as a first sample set; constructing a second analysis model by using the first sample set as a training sample; and predicting the type of the sample to be detected according to the first analysis model and the second analysis model.
In the prior art, the machine learning model often generates data with missing and false reports in the practical application process, the data with missing and false reports are generally processed in a large number of intervention modes through manual making of black and white lists and the like at present, and then new missing and false reports occur again, the efficiency of new manual intervention and manual processing are still needed is low, the generalization capability is low, and the accuracy of data pre-judgment based on the machine learning model is difficult to guarantee.
The invention provides a solution, aiming at realizing the automatic processing of the data which is not reported and is misreported, thereby improving the accuracy of data pre-judgment based on a machine learning model.
The embodiment of the invention provides data analysis equipment which is applied to predicting data types in any scene. For example, the method can be applied to computer security detection, and the data analysis device is applied to detect the security of the data so as to distinguish threatening data from non-threatening data in a scene relevant to computer operation.
As shown in fig. 1, the data analysis apparatus may include: a processor 1001 and a memory 1002. The memory 1002 is connected to the processor 1001. The memory 1002 may be a high-speed RAM memory or a non-volatile memory (e.g., a disk memory). The memory 1002 may alternatively be a storage device separate from the processor 1001.
Those skilled in the art will appreciate that the terminal structure shown in fig. 1 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.
As shown in fig. 1, a memory 1002, which is a kind of computer storage medium, may include therein a data analysis program. In the apparatus shown in fig. 1, the processor 1001 may be configured to call a data analysis program stored in the memory 1002 to perform the step operations of any of the following embodiments of the data analysis method.
Further, the embodiment of the invention also provides a data analysis method, which is applied to predicting the data type in any scene. For example, the method can be applied to computer security detection, and the data analysis device is applied to detect the security of the data so as to distinguish threatening data from non-threatening data in a scene relevant to computer operation.
In one embodiment, referring to fig. 2, the data analysis method includes:
step S10, analyzing the types of the plurality of first samples by adopting a first analysis model to obtain the prediction type of each first sample and obtain the actual type of each first sample;
the first analysis model is a machine learning model obtained by collecting a large amount of sample data of different types and training, and is used for predicting the type of the data. The first analysis model may be embodied as an arbitrary machine learning model such as a deep neural network model. The parameters forming the first analysis model specifically comprise hyper-parameters and model parameters, the hyper-parameters are preset parameters in the first analysis model, a large number of collected training samples are learned through machine learning models with known hyper-parameters and unknown model parameters, the model parameters in the models are determined, and the machine learning models with known hyper-parameters and known model parameters are used as the first analysis model.
And inputting a plurality of first samples of known actual types into the first analysis model, and obtaining an output result of the first analysis model to obtain the prediction type of each first sample. The actual type here refers to the actual data type of the first sample. The number of first samples should be as large as possible.
The actual type of the first sample may be obtained based on feedback information during the use of the first sample, or may be obtained through information of the person identification mark.
Step S20, determining an abnormal sample set according to the first sample with the difference between the prediction type and the actual type;
specifically, of the plurality of first samples, all the first samples having a difference between the prediction type and the actual type are used as the abnormal sample set.
Step S30, obtaining a sample at least comprising the abnormal sample set as a first sample set;
the abnormal sample set can be directly used as the first sample set, and all samples obtained by mixing the abnormal sample set with other specific sample sets can also be used as the first sample set. Wherein the actual type of each sample in the first set of samples is known. It should be noted that the amount of samples in the first set of samples is less than the total amount of the first samples.
Step S40, constructing a second analysis model by using the first sample set as a training sample;
and training a machine learning model with known hyper-parameters by adopting a first sample set with known actual types, and determining model parameters to obtain a second analysis model. The hyper-parameters of the second analysis model can be determined according to the hyper-parameters of the first analysis model, the hyper-parameters of the first analysis model can be used as the hyper-parameters of the second analysis model, and the hyper-parameters of the second analysis model can be obtained after the hyper-parameters of the first analysis model are corrected based on the characteristics of the abnormal sample set.
And step S50, predicting the type of the sample to be tested according to the first analysis model and the second analysis model.
Specifically, a fusion strategy of a first analysis model and a second analysis model for analyzing a sample to be detected is pre-specified, an analysis model of the sample to be detected is determined based on the fusion strategy and the sample to be detected, and the type of the sample to be detected is predicted by using the determined analysis model.
In this embodiment, a type to which a plurality of first samples belong is analyzed through a first analysis model to obtain a prediction type of each first sample, an actual type of each first sample is obtained, an abnormal sample set is determined according to the first samples with differences between the prediction type and the actual type, a second analysis model is constructed by taking the first sample set at least including the abnormal sample set as a training sample, and the type to which the sample to be tested belongs is predicted according to the first analysis model and the second analysis model. In the method, the abnormal sample in the prediction result of the first analysis model can directly construct a new second analysis model based on at least the training sample of the abnormal sample set, the type of the sample to be detected is predicted by combining the first analysis model and the second analysis model, the missing report and the false report of the first analysis model can be processed without manual intervention in the process, and the condition of missing report and false report is applied to the prediction of the new sample to be detected, so that the accuracy of data prediction based on a machine learning model is ensured.
Further, based on the above embodiments, another embodiment of the data analysis method of the present application is provided. In this embodiment, referring to fig. 3, after step S10, the method further includes:
step S20a, determining a normal sample set according to the first sample with the prediction type matched with the actual type;
specifically, of the plurality of first samples, all the first samples having the same prediction type as the actual type are taken as a normal sample set.
The sequence of the steps S20 and S20a is not limited in particular.
Step S30 includes:
and step S30a, selecting a part of samples in the normal sample set to be mixed with the abnormal sample set to obtain a first sample set.
The way of selecting part of the samples in the normal sample set can be selected according to actual requirements.
In this embodiment, a second analysis model is obtained by selecting a sample set obtained by mixing a part of samples in the normal sample set with the abnormal sample set, and the number of the whole samples is increased while analyzing the samples based on the abnormal samples, so that the accuracy of the determined second analysis model is increased.
Specifically, in order to further improve the accuracy of the second analysis model in analyzing different types of data, step S30a specifically includes: step S31, based on quantity balance rule, selecting partial samples in the normal sample set to mix with the abnormal sample set to obtain the first sample set; the number balancing rule is that the difference value of the total amount of the samples corresponding to each actual type in the first sample set is smaller than or equal to a preset value. Specifically, no matter how many samples corresponding to each actual type in the abnormal sample set exist, after the samples in the abnormal sample set and the samples in the normal sample set are correspondingly mixed according to the actual types, the sum of the number of the abnormal samples corresponding to each actual type in the obtained first sample set is approximately equal to the sum of the number of the normal samples. The preset value can be set according to the actual prediction accuracy requirement. Preferably, the total amount of samples corresponding to each actual type in the first sample set is equal, i.e. the difference between the total amounts of samples corresponding to the actual types is 0.
Based on this, step S31 may specifically include: and randomly extracting a part of samples in the normal sample set and mixing with the abnormal sample set based on the quantity balancing rule to obtain the first sample set. That is, the samples are randomly extracted from the normal sample set and mixed with the abnormal sample set until the difference between the total amount of the samples corresponding to each actual type in the first sample set is smaller than or equal to the preset value.
In other embodiments, referring to fig. 4, step S31 may further specifically include:
step S311, defining each sample in the normal sample set as a first sample, defining a sample in the abnormal sample set, which is matched with an actual type of the first sample, as a second sample, and obtaining a first distance threshold corresponding to each actual type;
the first distance threshold is specifically a limit value of the euclidean distance between the extracted sample and the abnormal sample when the samples are extracted in the normal sample set based on the euclidean distance. The different actual types correspond to different first distance thresholds. The first distance threshold may be specifically determined according to the sample size of the abnormal sample set, the distribution characteristics of the samples in the normal sample set, and the like.
Step S312, in the first samples corresponding to each actual type, determining the first samples whose euclidean distances to each corresponding second sample are less than or equal to the first distance threshold as samples to be mixed;
and dividing the first samples with the same actual type into a second sample set to obtain a plurality of second sample sets. And extracting a first sample from each second sample, wherein when the Euclidean distance between the first sample and each second sample with the same actual type is smaller than or equal to a first distance threshold value, the first sample is higher in similarity with the sample of the corresponding type in the abnormal sample set, and therefore the first sample can be used as a sample to be mixed.
And when the to-be-mixed samples corresponding to the actual types all meet the quantity balancing rule, executing step S313, and mixing the samples in the abnormal sample set with the corresponding to-be-mixed samples according to the actual types to obtain the first sample set.
Specifically, when the samples to be mixed all meet the quantity balance rule, all samples in the abnormal sample set with the same actual type and the corresponding samples to be mixed are mixed to obtain a first sample set.
In this embodiment, samples are extracted from the normal sample set and mixed with the abnormal sample set according to the steps S311 to S313, so that all the samples in the first sample set are samples with a high probability of false alarm and missing report, and therefore, the second analysis model obtained based on the training of the first sample set can accurately identify the samples with false alarm and missing report, and the accuracy of the prediction result can be further improved when the type of the sample to be detected is predicted based on the first analysis model and the second analysis model.
Further, referring to fig. 4, before step S313, the method further includes:
step S31a, judging whether each sample to be mixed meets the quantity balance rule;
if yes, go to step S313; if not, the process returns to step S312 after steps S314 and S315 are executed.
Step S314, determining the actual type corresponding to the sample to be mixed which does not meet the quantity balance rule as the target type;
step S315, adjusting a first distance threshold corresponding to the target type.
Wherein the adjustment of the first distance threshold may be increased or decreased by a predetermined magnitude. In addition, in order to ensure the accuracy of the determined first distance threshold and improve the efficiency of generating the first sample set, the first distance threshold may be determined according to the difference between the total sample distance corresponding to the target type and the total sample distance satisfying the quantity balancing rule.
Here, when a certain type of sample to be mixed does not satisfy the quantity balance rule, the first distance threshold is adjusted, so that the situation that the balance of various types of samples in the first sample set is influenced due to unreasonable setting of the first distance threshold is avoided, the accuracy of the second analysis model is effectively guaranteed, and the prediction precision of the sample to be measured is further improved.
Further, based on any of the above embodiments, another embodiment of the data analysis method of the present application is provided. In the present embodiment, referring to fig. 5, step S50 includes:
step S51, fusing the first analysis model and the second analysis model to obtain a third analysis model;
and fusing the first analysis model and the second analysis model according to a model fusion strategy to obtain a third analysis model. The model fusion strategy can be selected according to actual requirements.
Specifically, step S51 includes:
step S511, obtaining a training sample set corresponding to the first analysis model as a third sample set;
the third sample set is specifically a training sample for constructing the first analysis model.
Step S512, mixing the first sample set and the third sample set to obtain a fourth sample set;
step S513, performing feature extraction on the fourth sample set by using the first analysis model to obtain a first feature parameter, and performing feature extraction on the fourth sample set by using the second analysis model to obtain a second feature parameter;
the first characteristic parameter and the second characteristic parameter are both corresponding analysis models, and after the samples in the fourth sample set are analyzed, probability values of the samples belonging to each preset type are obtained. For example, when the preset type includes a black sample and a white sample, after the first analysis model analyzes one sample in the fourth sample set, the probability that the sample belongs to the black sample and the probability that the sample belongs to the white sample can be obtained as the first characteristic parameter, and after the second analysis model analyzes one sample in the fourth sample set, the probability that the sample belongs to the black sample and the probability that the sample belongs to the white goat can be obtained as the second characteristic parameter.
Step S514, using the first characteristic parameter, the second characteristic parameter and the actual type corresponding to each sample in the fourth sample set as characteristic parameters;
and step S515, constructing the third analysis model by using the characteristic parameters as training samples.
Step S52, analyzing the sample to be tested by using the first analysis model, the second analysis model, and the third analysis module, and determining the type of the sample to be tested.
Specifically, the sample to be tested can be analyzed by respectively adopting the first analysis model, the second analysis model and the third analysis model, and comprehensive analysis is performed based on analysis results respectively obtained by the three models to obtain the type of the sample to be tested.
In addition, the analysis result of any one or two analysis models can be used as the input data of the rest analysis models, and the output result of the rest analysis models can be used as the type of the sample to be detected. Specifically, based on the steps S511 to S515, the step S52 may specifically include: performing feature extraction on the sample to be detected by adopting the first analysis model to obtain a third feature parameter, and performing feature extraction on the sample to be detected by adopting the second analysis model to obtain a fourth feature parameter; and inputting the third characteristic parameter and the fourth characteristic parameter into the third analysis model, and taking a result output by the third analysis model as the type of the sample to be detected. The extraction manner and concept of the third feature parameter and the fourth feature parameter are similar to those of the first feature parameter and the second feature parameter, and are not described herein again.
In this embodiment, a third analysis model is generated based on the first analysis model and the second analysis model, and the three analysis models are combined to analyze the sample to be detected, so that the accuracy of the prediction result is improved. When the third analysis model is generated, the first analysis model and the second analysis model are respectively adopted to extract the characteristics of the combined training sample set, and the extracted characteristics are combined with the actual type of the sample to train the third analysis model, so that the third analysis model can accurately identify the possibility of false alarm and missed alarm of all samples in different types. Based on the method, when the type of the sample to be detected is predicted, the first analysis model and the second analysis model are respectively adopted to perform feature extraction on the sample to be detected and then input into the third analysis model to perform type prediction, so that the accuracy of the type of the obtained sample to be detected is ensured.
Further, based on any of the above embodiments, a further embodiment of the data analysis method of the present application is provided. Referring to fig. 6, step S50 includes:
step S501, judging whether Euclidean distances between each sample in the abnormal sample set and the sample to be detected are smaller than or equal to the second distance threshold value; if yes, go to step S502; if not, go to step S503.
The second distance threshold may be specifically determined according to the actual sample size of each type of sample in the abnormal sample set, the characteristic condition of each sample, the prediction accuracy requirement, and the like.
When the Euclidean distance between the sample to be detected and each sample in the abnormal sample set is smaller than or equal to the second distance threshold, the probability of false alarm missing in the analysis of the sample by adopting the first analysis model is high, and therefore the type of the sample to be detected is predicted by adopting the second analysis model; when the Euclidean distance between the sample to be detected and each sample in the abnormal sample set is not completely smaller than or equal to the second distance threshold, the probability of false alarm missing in the analysis of the sample by adopting the first analysis model is not high, and therefore the accuracy of the type of the sample to be detected can be ensured by adopting the first analysis model. Wherein, in order to further improve the accuracy of the analysis of the first analytical model. When the Euclidean distance between the sample to be detected and each sample in the abnormal sample set is not totally smaller than or equal to the second distance threshold value, and the number of the samples, the Euclidean distance between the abnormal sample set and the sample to be detected is smaller than or equal to the second distance threshold value, the first analysis model is adopted to analyze the sample to be detected; otherwise, analyzing the sample to be detected by adopting a second analysis model.
Step S502, predicting the type of the sample to be detected by adopting the second analysis model;
step S503, predicting the type of the sample to be detected by adopting the first analysis model.
In the embodiment, after the possibility of false alarm missing in the analysis of the sample to be detected by using the first analysis model is analyzed based on the Euclidean distance, the type of the sample to be detected is predicted by using the second analysis model when the possibility is high; otherwise, the first analysis model is adopted to predict the type of the sample to be detected, so that the condition of false alarm missing in the prediction of the type of the sample to be detected is effectively avoided, and the accuracy of predicting the type of the sample to be detected based on the machine learning model is improved.
Furthermore, the present invention also provides a data analysis apparatus including:
the analysis module is used for analyzing the types of the plurality of first samples by adopting a first analysis model to obtain the prediction type of each first sample and obtain the actual type of each first sample;
the anomaly analysis module is used for determining an anomaly sample set according to a first sample with a difference between a prediction type and an actual type;
a sample set generating module, configured to obtain a sample at least including the abnormal sample set as a first sample set;
the modeling module is used for constructing a second analysis model by taking the first sample set as a training sample;
and the prediction module predicts the type of the sample to be detected according to the first analysis model and the second analysis model.
In this embodiment, specific steps executed by various modules may refer to the step operations corresponding to any embodiment of the data analysis method, which are not described herein again.
In addition, an embodiment of the present invention further provides a computer-readable storage medium, where a data analysis program is stored on the computer-readable storage medium, and when the data analysis program is executed by a processor, the data analysis program implements the step operations of any embodiment of the data analysis method.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (13)

1. A data analysis method, characterized in that the data analysis method comprises the steps of:
analyzing the types of the plurality of first samples by adopting a first analysis model to obtain the prediction type of each first sample and obtain the actual type of each first sample;
determining an abnormal sample set according to a first sample with a difference between the prediction type and the actual type;
acquiring a sample at least comprising the abnormal sample set as a first sample set;
constructing a second analysis model by using the first sample set as a training sample;
and predicting the type of the sample to be detected according to the first analysis model and the second analysis model.
2. The data analysis method of claim 1, wherein the step of analyzing the types of the plurality of first samples by using the first analysis model to obtain the predicted type of each first sample, and obtaining the actual type of each first sample further comprises:
determining a normal sample set according to a first sample with the prediction type matched with the actual type;
the step of obtaining samples comprising at least the set of abnormal samples as a first set of samples comprises:
and selecting a part of samples from the normal sample set to be mixed with the abnormal sample set to obtain a first sample set.
3. The data analysis method of claim 2, wherein the step of selecting a portion of the samples from the normal sample set to be mixed with the abnormal sample set to obtain the first sample set comprises:
based on a quantity balance rule, selecting a part of samples from the normal sample set to be mixed with the abnormal sample set to obtain the first sample set; the number balancing rule is that the difference value of the total amount of the samples corresponding to each actual type in the first sample set is smaller than or equal to a preset value.
4. The data analysis method of claim 3, wherein the step of selecting a portion of the samples in the normal sample set to be mixed with the abnormal sample set based on the quantitative balance rule to obtain the first sample set comprises:
and randomly extracting a part of samples in the normal sample set and mixing with the abnormal sample set based on the quantity balancing rule to obtain the first sample set.
5. The data analysis method of claim 3, wherein the step of selecting a portion of the samples in the normal sample set to be mixed with the abnormal sample set based on the quantitative balance rule to obtain the first sample set comprises:
defining each sample in the normal sample set as a first sample, defining a sample matched with the actual type of the first sample in the abnormal sample set as a second sample, and acquiring a first distance threshold corresponding to each actual type;
in the first samples corresponding to the actual types, determining the first samples of which Euclidean distances to each corresponding second sample are less than or equal to the first distance threshold value as samples to be mixed;
and when the samples to be mixed corresponding to the actual types meet the quantity balance rule, mixing the samples in the abnormal sample set with the corresponding samples to be mixed according to the actual types to obtain the first sample set.
6. The data analysis method of claim 5, wherein, in the first samples corresponding to each of the actual types, the step of determining, as the samples to be mixed, the first samples having the euclidean distances from each corresponding second sample smaller than or equal to the first distance threshold further comprises:
judging whether each sample to be mixed meets the quantity balancing rule or not;
if so, executing the step of mixing the samples in the abnormal sample set with the corresponding samples to be mixed according to each actual type to obtain the first sample set;
if not, determining that the actual type corresponding to the sample to be mixed which does not meet the quantity balance rule is the target type;
adjusting a first distance threshold corresponding to the target type;
and returning to execute the first samples corresponding to the actual types, and determining the first samples of which the Euclidean distances to each corresponding second sample are less than or equal to the first distance threshold value as samples to be mixed.
7. The data analysis method of any one of claims 1 to 6, wherein the step of predicting the type of the sample to be tested according to the first analysis model and the second analysis model comprises:
fusing the first analysis model and the second analysis model to obtain a third analysis model;
and analyzing the sample to be detected by adopting the first analysis model, the second analysis model and the third analysis module, and determining the type of the sample to be detected.
8. The data analysis method of claim 7, wherein the step of fusing the first analytical model and the second analytical model to obtain a third analytical model comprises:
acquiring a training sample set corresponding to the first analysis model as a third sample set;
mixing the first sample set and the third sample set to obtain a fourth sample set;
performing feature extraction on the fourth sample set by using the first analysis model to obtain a first feature parameter, and performing feature extraction on the fourth sample set by using the second analysis model to obtain a second feature parameter;
taking the first characteristic parameter, the second characteristic parameter and the actual type corresponding to each sample in the fourth sample set as characteristic parameters;
and constructing the third analysis model by using the characteristic parameters as training samples.
9. The data analysis method of claim 8, wherein the step of analyzing the sample to be tested by using the first analysis model, the second analysis model and the third analysis module to determine the type of the sample to be tested comprises:
performing feature extraction on the sample to be detected by adopting the first analysis model to obtain a third feature parameter, and performing feature extraction on the sample to be detected by adopting the second analysis model to obtain a fourth feature parameter;
and inputting the third characteristic parameter and the fourth characteristic parameter into the third analysis model, and taking a result output by the third analysis model as the type of the sample to be detected.
10. The data analysis method of any one of claims 1 to 6, wherein the step of predicting the type of the sample to be tested according to the first analysis model and the second analysis model comprises:
judging whether Euclidean distances between each sample in the abnormal sample set and the sample to be detected are smaller than or equal to the second distance threshold value;
if so, predicting the type of the sample to be detected by adopting the second analysis model;
if not, predicting the type of the sample to be detected by adopting the first analysis model.
11. A data analysis apparatus, characterized in that the data analysis apparatus comprises:
the analysis module is used for analyzing the types of the plurality of first samples by adopting a first analysis model to obtain the prediction type of each first sample and obtain the actual type of each first sample;
the anomaly analysis module is used for determining an anomaly sample set according to a first sample with a difference between a prediction type and an actual type;
a sample set generating module, configured to obtain a sample at least including the abnormal sample set as a first sample set;
the modeling module is used for constructing a second analysis model by taking the first sample set as a training sample;
and the prediction module predicts the type of the sample to be detected according to the first analysis model and the second analysis model.
12. A data analysis apparatus, characterized in that the data analysis apparatus comprises: memory, a processor and a data analysis program stored on the memory and executable on the processor, the data analysis program when executed by the processor implementing the steps of the data analysis method according to any one of claims 1 to 10.
13. A readable storage medium, characterized in that the readable storage medium has stored thereon a data analysis program which, when executed by a processor, implements the steps of the data analysis method according to any one of claims 1 to 10.
CN202010313979.3A 2020-04-20 2020-04-20 Data analysis method, device, equipment and readable storage medium Pending CN113537262A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010313979.3A CN113537262A (en) 2020-04-20 2020-04-20 Data analysis method, device, equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010313979.3A CN113537262A (en) 2020-04-20 2020-04-20 Data analysis method, device, equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN113537262A true CN113537262A (en) 2021-10-22

Family

ID=78123704

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010313979.3A Pending CN113537262A (en) 2020-04-20 2020-04-20 Data analysis method, device, equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN113537262A (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170061322A1 (en) * 2015-08-31 2017-03-02 International Business Machines Corporation Automatic generation of training data for anomaly detection using other user's data samples
US20180046934A1 (en) * 2016-08-09 2018-02-15 International Business Machines Corporation Warning filter based on machine learning
CN108111489A (en) * 2017-12-07 2018-06-01 阿里巴巴集团控股有限公司 URL attack detection methods, device and electronic equipment
US20190026466A1 (en) * 2017-07-24 2019-01-24 Crowdstrike, Inc. Malware detection using local computational models
CN109492395A (en) * 2018-10-31 2019-03-19 厦门安胜网络科技有限公司 A kind of method, apparatus and storage medium detecting rogue program
CN110084271A (en) * 2019-03-22 2019-08-02 同盾控股有限公司 A kind of other recognition methods of picture category and device
CN110472410A (en) * 2018-05-11 2019-11-19 阿里巴巴集团控股有限公司 Identify method, equipment and the data processing method of data
WO2020046575A1 (en) * 2018-08-31 2020-03-05 Sophos Limited Enterprise network threat detection

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170061322A1 (en) * 2015-08-31 2017-03-02 International Business Machines Corporation Automatic generation of training data for anomaly detection using other user's data samples
US20180046934A1 (en) * 2016-08-09 2018-02-15 International Business Machines Corporation Warning filter based on machine learning
US20190026466A1 (en) * 2017-07-24 2019-01-24 Crowdstrike, Inc. Malware detection using local computational models
CN108111489A (en) * 2017-12-07 2018-06-01 阿里巴巴集团控股有限公司 URL attack detection methods, device and electronic equipment
CN110472410A (en) * 2018-05-11 2019-11-19 阿里巴巴集团控股有限公司 Identify method, equipment and the data processing method of data
WO2020046575A1 (en) * 2018-08-31 2020-03-05 Sophos Limited Enterprise network threat detection
CN109492395A (en) * 2018-10-31 2019-03-19 厦门安胜网络科技有限公司 A kind of method, apparatus and storage medium detecting rogue program
CN110084271A (en) * 2019-03-22 2019-08-02 同盾控股有限公司 A kind of other recognition methods of picture category and device

Similar Documents

Publication Publication Date Title
CN111181939B (en) Network intrusion detection method and device based on ensemble learning
US9923912B2 (en) Learning detector of malicious network traffic from weak labels
CN111814902A (en) Target detection model training method, target identification method, device and medium
CN109816200B (en) Task pushing method, device, computer equipment and storage medium
CN111783505A (en) Method and device for identifying forged faces and computer-readable storage medium
CN111191201A (en) User identification method, device and equipment based on data buried points and storage medium
CN111611583A (en) Malicious code homology analysis method and malicious code homology analysis device
CN113516144A (en) Target detection method and device and computing equipment
CN111783812A (en) Method and device for identifying forbidden images and computer readable storage medium
CN109299592B (en) Man-machine behavior characteristic boundary construction method, system, server and storage medium
CN113468524B (en) RASP-based machine learning model security detection method
CN114218998A (en) Power system abnormal behavior analysis method based on hidden Markov model
CN102243707A (en) Character recognition result verification apparatus and character recognition result verification method
CN112182269B (en) Training of image classification model, image classification method, device, equipment and medium
CN111539390A (en) Small target image identification method, equipment and system based on Yolov3
CN113537262A (en) Data analysis method, device, equipment and readable storage medium
CN112036169A (en) Event recognition model optimization method, device and equipment and readable storage medium
CN114445716B (en) Key point detection method, key point detection device, computer device, medium, and program product
Kusa et al. Vombat: A tool for visualising evaluation measure behaviour in high-recall search tasks
KR102265678B1 (en) Method of predicting difficulty of bounding box work in the image file and computer apparatus conducting thereof
CN113537253A (en) Infrared image target detection method and device, computing equipment and storage medium
CN113918471A (en) Test case processing method and device and computer readable storage medium
CN111209567B (en) Method and device for judging perceptibility of improving robustness of detection model
CN111191239B (en) Process detection method and system for application program
CN112861689A (en) Searching method and device of coordinate recognition model based on NAS technology

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination