CN113537262A

CN113537262A - Data analysis method, device, equipment and readable storage medium

Info

Publication number: CN113537262A
Application number: CN202010313979.3A
Authority: CN
Inventors: 刘彦南; 郑坚秋
Original assignee: Sangfor Technologies Co Ltd
Current assignee: Sangfor Technologies Co Ltd
Priority date: 2020-04-20
Filing date: 2020-04-20
Publication date: 2021-10-22

Abstract

The invention discloses a data analysis method, which comprises the following steps: analyzing the types of the plurality of first samples by adopting a first analysis model to obtain the prediction type of each first sample and obtain the actual type of each first sample; determining an abnormal sample set according to a first sample with a difference between the prediction type and the actual type; obtaining a sample at least comprising an abnormal sample set as a first sample set; constructing a second analysis model by using the first sample set as a training sample; and predicting the type of the sample to be detected according to the first analysis model and the second analysis model. The invention also discloses a data analysis device, data analysis equipment and a readable storage medium. The invention aims to realize the automatic processing of the missing and false alarm data, thereby improving the accuracy of data pre-judgment based on a machine learning model.

Description

Data analysis method, device, equipment and readable storage medium

Technical Field

The present invention relates to the field of data analysis, and more particularly, to a data analysis method, a data analysis apparatus, a data analysis device, and a readable storage medium.

Background

With the development of the technology, the artificial intelligence technology is applied to more and more detection scenes, such as computer security detection scenes, and a machine learning model is generally constructed by collecting training samples, and the machine learning model is adopted to pre-judge the type of unknown data.

However, due to insufficient training, incorrect super-parameter setting, randomness of the training algorithm, and the like, the machine learning model often generates data with missing reports and false reports in the actual application process. At present, data which are missed and misreported are generally processed by a large amount of intervention modes such as manual black and white list making, and then new missed and misreported data still need new manual intervention and manual processing, so that the efficiency is low, the generalization capability is low, and the accuracy of data pre-judgment based on a machine learning model is difficult to ensure.

Disclosure of Invention

The invention mainly aims to provide a data analysis method, which aims to realize automatic processing of misreported data, thereby improving the accuracy of data pre-judgment based on a machine learning model.

In order to achieve the above object, the present invention provides a data analysis method, including the steps of:

analyzing the types of the plurality of first samples by adopting a first analysis model to obtain the prediction type of each first sample and obtain the actual type of each first sample;

determining an abnormal sample set according to a first sample with a difference between the prediction type and the actual type;

acquiring a sample at least comprising the abnormal sample set as a first sample set;

constructing a second analysis model by using the first sample set as a training sample;

and predicting the type of the sample to be detected according to the first analysis model and the second analysis model.

Optionally, after the step of analyzing the types of the plurality of first samples by using the first analysis model to obtain the predicted type of each first sample, obtaining the actual type of each first sample, the method further includes:

determining a normal sample set according to a first sample with the prediction type matched with the actual type;

the step of obtaining samples comprising at least the set of abnormal samples as a first set of samples comprises:

and selecting a part of samples from the normal sample set to be mixed with the abnormal sample set to obtain a first sample set.

Optionally, the step of selecting a part of samples from the normal sample set and mixing the part of samples with the abnormal sample set to obtain a first sample set includes:

based on a quantity balance rule, selecting a part of samples from the normal sample set to be mixed with the abnormal sample set to obtain the first sample set; the number balancing rule is that the difference value of the total amount of the samples corresponding to each actual type in the first sample set is smaller than or equal to a preset value.

Optionally, the step of selecting a part of samples from the normal sample set to be mixed with the abnormal sample set based on a quantity balancing rule to obtain a first sample set includes:

and randomly extracting a part of samples in the normal sample set and mixing with the abnormal sample set based on the quantity balancing rule to obtain the first sample set.

defining each sample in the normal sample set as a first sample, defining a sample matched with the actual type of the first sample in the abnormal sample set as a second sample, and acquiring a first distance threshold corresponding to each actual type;

in the first samples corresponding to the actual types, determining the first samples of which Euclidean distances to each corresponding second sample are less than or equal to the first distance threshold value as samples to be mixed;

and when the samples to be mixed corresponding to the actual types meet the quantity balance rule, mixing the samples in the abnormal sample set with the corresponding samples to be mixed according to the actual types to obtain the first sample set.

Optionally, after the step of determining, in the first samples corresponding to each of the actual types, the first samples whose euclidean distances to each corresponding second sample are less than or equal to the first distance threshold as the samples to be mixed, the method further includes:

judging whether each sample to be mixed meets the quantity balancing rule or not;

if so, executing the step of mixing the samples in the abnormal sample set with the corresponding samples to be mixed according to each actual type to obtain the first sample set;

if not, determining that the actual type corresponding to the sample to be mixed which does not meet the quantity balance rule is the target type;

adjusting a first distance threshold corresponding to the target type;

and returning to execute the first samples corresponding to the actual types, and determining the first samples of which the Euclidean distances to each corresponding second sample are less than or equal to the first distance threshold value as samples to be mixed.

Optionally, the step of predicting the type of the sample to be tested according to the first analysis model and the second analysis model includes:

fusing the first analysis model and the second analysis model to obtain a third analysis model;

and analyzing the sample to be detected by adopting the first analysis model, the second analysis model and the third analysis module, and determining the type of the sample to be detected.

Optionally, the step of fusing the first analysis model and the second analysis model to obtain a third analysis model includes:

acquiring a training sample set corresponding to the first analysis model as a third sample set;

mixing the first sample set and the third sample set to obtain a fourth sample set;

performing feature extraction on the fourth sample set by using the first analysis model to obtain a first feature parameter, and performing feature extraction on the fourth sample set by using the second analysis model to obtain a second feature parameter;

taking the first characteristic parameter, the second characteristic parameter and the actual type corresponding to each sample in the fourth sample set as characteristic parameters;

and constructing the third analysis model by using the characteristic parameters as training samples.

Optionally, the step of analyzing the sample to be tested by using the first analysis model, the second analysis model, and the third analysis module, and determining the type of the sample to be tested includes:

performing feature extraction on the sample to be detected by adopting the first analysis model to obtain a third feature parameter, and performing feature extraction on the sample to be detected by adopting the second analysis model to obtain a fourth feature parameter;

and inputting the third characteristic parameter and the fourth characteristic parameter into the third analysis model, and taking a result output by the third analysis model as the type of the sample to be detected.

judging whether Euclidean distances between each sample in the abnormal sample set and the sample to be detected are smaller than or equal to the second distance threshold value;

if so, predicting the type of the sample to be detected by adopting the second analysis model;

if not, predicting the type of the sample to be detected by adopting the first analysis model.

In order to achieve the above object, the present application also proposes a data analysis device including:

the analysis module is used for analyzing the types of the plurality of first samples by adopting a first analysis model to obtain the prediction type of each first sample and obtain the actual type of each first sample;

the anomaly analysis module is used for determining an anomaly sample set according to a first sample with a difference between a prediction type and an actual type;

a sample set generating module, configured to obtain a sample at least including the abnormal sample set as a first sample set;

the modeling module is used for constructing a second analysis model by taking the first sample set as a training sample;

and the prediction module predicts the type of the sample to be detected according to the first analysis model and the second analysis model.

Further, in order to achieve the above object, the present application also proposes a data analysis apparatus comprising: a memory, a processor and a data analysis program stored on the memory and executable on the processor, the data analysis program when executed by the processor implementing the steps of the data analysis method as claimed in any one of the above.

In addition, in order to achieve the above object, the present application also proposes a readable storage medium having stored thereon a data analysis program which, when executed by a processor, implements the steps of the data analysis method as described in any one of the above.

The invention provides a data analysis method, which comprises the steps of analyzing the types of a plurality of first samples through a first analysis model to obtain the prediction type of each first sample, obtaining the actual type of each first sample, determining an abnormal sample set according to the first samples with the difference between the prediction type and the actual type, constructing a second analysis model by taking the first sample set at least comprising the abnormal sample set as a training sample, and predicting the types of samples to be tested according to the first analysis model and the second analysis model. In the method, the abnormal sample in the prediction result of the first analysis model can directly construct a new second analysis model based on at least the training sample of the abnormal sample set, the type of the sample to be detected is predicted by combining the first analysis model and the second analysis model, the missing report and the false report of the first analysis model can be processed without manual intervention in the process, and the condition of missing report and false report is applied to the prediction of the new sample to be detected, so that the accuracy of data prediction based on a machine learning model is ensured.

Drawings

FIG. 1 is a schematic structural diagram of a hardware operating environment related to a data analysis device according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart diagram illustrating an embodiment of a data analysis method according to the present invention;

FIG. 3 is a schematic flow chart diagram illustrating another embodiment of a data analysis method according to the present invention;

FIG. 4 is a detailed flowchart of step S30a in FIG. 3;

FIG. 5 is a schematic flow chart diagram illustrating a data analysis method according to another embodiment of the present invention;

FIG. 6 is a flow chart illustrating a data analysis method according to another embodiment of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The main solution of the embodiment of the invention is as follows: analyzing the types of the plurality of first samples by adopting a first analysis model to obtain the prediction type of each first sample and obtain the actual type of each first sample; determining an abnormal sample set according to a first sample with a difference between the prediction type and the actual type; acquiring a sample at least comprising the abnormal sample set as a first sample set; constructing a second analysis model by using the first sample set as a training sample; and predicting the type of the sample to be detected according to the first analysis model and the second analysis model.

In the prior art, the machine learning model often generates data with missing and false reports in the practical application process, the data with missing and false reports are generally processed in a large number of intervention modes through manual making of black and white lists and the like at present, and then new missing and false reports occur again, the efficiency of new manual intervention and manual processing are still needed is low, the generalization capability is low, and the accuracy of data pre-judgment based on the machine learning model is difficult to guarantee.

The invention provides a solution, aiming at realizing the automatic processing of the data which is not reported and is misreported, thereby improving the accuracy of data pre-judgment based on a machine learning model.

The embodiment of the invention provides data analysis equipment which is applied to predicting data types in any scene. For example, the method can be applied to computer security detection, and the data analysis device is applied to detect the security of the data so as to distinguish threatening data from non-threatening data in a scene relevant to computer operation.

As shown in fig. 1, the data analysis apparatus may include: a processor 1001 and a memory 1002. The memory 1002 is connected to the processor 1001. The memory 1002 may be a high-speed RAM memory or a non-volatile memory (e.g., a disk memory). The memory 1002 may alternatively be a storage device separate from the processor 1001.

Those skilled in the art will appreciate that the terminal structure shown in fig. 1 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

As shown in fig. 1, a memory 1002, which is a kind of computer storage medium, may include therein a data analysis program. In the apparatus shown in fig. 1, the processor 1001 may be configured to call a data analysis program stored in the memory 1002 to perform the step operations of any of the following embodiments of the data analysis method.

Further, the embodiment of the invention also provides a data analysis method, which is applied to predicting the data type in any scene. For example, the method can be applied to computer security detection, and the data analysis device is applied to detect the security of the data so as to distinguish threatening data from non-threatening data in a scene relevant to computer operation.

In one embodiment, referring to fig. 2, the data analysis method includes:

step S10, analyzing the types of the plurality of first samples by adopting a first analysis model to obtain the prediction type of each first sample and obtain the actual type of each first sample;

the first analysis model is a machine learning model obtained by collecting a large amount of sample data of different types and training, and is used for predicting the type of the data. The first analysis model may be embodied as an arbitrary machine learning model such as a deep neural network model. The parameters forming the first analysis model specifically comprise hyper-parameters and model parameters, the hyper-parameters are preset parameters in the first analysis model, a large number of collected training samples are learned through machine learning models with known hyper-parameters and unknown model parameters, the model parameters in the models are determined, and the machine learning models with known hyper-parameters and known model parameters are used as the first analysis model.

And inputting a plurality of first samples of known actual types into the first analysis model, and obtaining an output result of the first analysis model to obtain the prediction type of each first sample. The actual type here refers to the actual data type of the first sample. The number of first samples should be as large as possible.

The actual type of the first sample may be obtained based on feedback information during the use of the first sample, or may be obtained through information of the person identification mark.

Step S20, determining an abnormal sample set according to the first sample with the difference between the prediction type and the actual type;

specifically, of the plurality of first samples, all the first samples having a difference between the prediction type and the actual type are used as the abnormal sample set.

Step S30, obtaining a sample at least comprising the abnormal sample set as a first sample set;

the abnormal sample set can be directly used as the first sample set, and all samples obtained by mixing the abnormal sample set with other specific sample sets can also be used as the first sample set. Wherein the actual type of each sample in the first set of samples is known. It should be noted that the amount of samples in the first set of samples is less than the total amount of the first samples.

Step S40, constructing a second analysis model by using the first sample set as a training sample;

and training a machine learning model with known hyper-parameters by adopting a first sample set with known actual types, and determining model parameters to obtain a second analysis model. The hyper-parameters of the second analysis model can be determined according to the hyper-parameters of the first analysis model, the hyper-parameters of the first analysis model can be used as the hyper-parameters of the second analysis model, and the hyper-parameters of the second analysis model can be obtained after the hyper-parameters of the first analysis model are corrected based on the characteristics of the abnormal sample set.

And step S50, predicting the type of the sample to be tested according to the first analysis model and the second analysis model.

Specifically, a fusion strategy of a first analysis model and a second analysis model for analyzing a sample to be detected is pre-specified, an analysis model of the sample to be detected is determined based on the fusion strategy and the sample to be detected, and the type of the sample to be detected is predicted by using the determined analysis model.

In this embodiment, a type to which a plurality of first samples belong is analyzed through a first analysis model to obtain a prediction type of each first sample, an actual type of each first sample is obtained, an abnormal sample set is determined according to the first samples with differences between the prediction type and the actual type, a second analysis model is constructed by taking the first sample set at least including the abnormal sample set as a training sample, and the type to which the sample to be tested belongs is predicted according to the first analysis model and the second analysis model. In the method, the abnormal sample in the prediction result of the first analysis model can directly construct a new second analysis model based on at least the training sample of the abnormal sample set, the type of the sample to be detected is predicted by combining the first analysis model and the second analysis model, the missing report and the false report of the first analysis model can be processed without manual intervention in the process, and the condition of missing report and false report is applied to the prediction of the new sample to be detected, so that the accuracy of data prediction based on a machine learning model is ensured.

Further, based on the above embodiments, another embodiment of the data analysis method of the present application is provided. In this embodiment, referring to fig. 3, after step S10, the method further includes:

step S20a, determining a normal sample set according to the first sample with the prediction type matched with the actual type;

specifically, of the plurality of first samples, all the first samples having the same prediction type as the actual type are taken as a normal sample set.

The sequence of the steps S20 and S20a is not limited in particular.

Step S30 includes:

and step S30a, selecting a part of samples in the normal sample set to be mixed with the abnormal sample set to obtain a first sample set.

The way of selecting part of the samples in the normal sample set can be selected according to actual requirements.

In this embodiment, a second analysis model is obtained by selecting a sample set obtained by mixing a part of samples in the normal sample set with the abnormal sample set, and the number of the whole samples is increased while analyzing the samples based on the abnormal samples, so that the accuracy of the determined second analysis model is increased.

Specifically, in order to further improve the accuracy of the second analysis model in analyzing different types of data, step S30a specifically includes: step S31, based on quantity balance rule, selecting partial samples in the normal sample set to mix with the abnormal sample set to obtain the first sample set; the number balancing rule is that the difference value of the total amount of the samples corresponding to each actual type in the first sample set is smaller than or equal to a preset value. Specifically, no matter how many samples corresponding to each actual type in the abnormal sample set exist, after the samples in the abnormal sample set and the samples in the normal sample set are correspondingly mixed according to the actual types, the sum of the number of the abnormal samples corresponding to each actual type in the obtained first sample set is approximately equal to the sum of the number of the normal samples. The preset value can be set according to the actual prediction accuracy requirement. Preferably, the total amount of samples corresponding to each actual type in the first sample set is equal, i.e. the difference between the total amounts of samples corresponding to the actual types is 0.

Based on this, step S31 may specifically include: and randomly extracting a part of samples in the normal sample set and mixing with the abnormal sample set based on the quantity balancing rule to obtain the first sample set. That is, the samples are randomly extracted from the normal sample set and mixed with the abnormal sample set until the difference between the total amount of the samples corresponding to each actual type in the first sample set is smaller than or equal to the preset value.

In other embodiments, referring to fig. 4, step S31 may further specifically include:

step S311, defining each sample in the normal sample set as a first sample, defining a sample in the abnormal sample set, which is matched with an actual type of the first sample, as a second sample, and obtaining a first distance threshold corresponding to each actual type;

the first distance threshold is specifically a limit value of the euclidean distance between the extracted sample and the abnormal sample when the samples are extracted in the normal sample set based on the euclidean distance. The different actual types correspond to different first distance thresholds. The first distance threshold may be specifically determined according to the sample size of the abnormal sample set, the distribution characteristics of the samples in the normal sample set, and the like.

Step S312, in the first samples corresponding to each actual type, determining the first samples whose euclidean distances to each corresponding second sample are less than or equal to the first distance threshold as samples to be mixed;

and dividing the first samples with the same actual type into a second sample set to obtain a plurality of second sample sets. And extracting a first sample from each second sample, wherein when the Euclidean distance between the first sample and each second sample with the same actual type is smaller than or equal to a first distance threshold value, the first sample is higher in similarity with the sample of the corresponding type in the abnormal sample set, and therefore the first sample can be used as a sample to be mixed.

And when the to-be-mixed samples corresponding to the actual types all meet the quantity balancing rule, executing step S313, and mixing the samples in the abnormal sample set with the corresponding to-be-mixed samples according to the actual types to obtain the first sample set.

Specifically, when the samples to be mixed all meet the quantity balance rule, all samples in the abnormal sample set with the same actual type and the corresponding samples to be mixed are mixed to obtain a first sample set.

In this embodiment, samples are extracted from the normal sample set and mixed with the abnormal sample set according to the steps S311 to S313, so that all the samples in the first sample set are samples with a high probability of false alarm and missing report, and therefore, the second analysis model obtained based on the training of the first sample set can accurately identify the samples with false alarm and missing report, and the accuracy of the prediction result can be further improved when the type of the sample to be detected is predicted based on the first analysis model and the second analysis model.

Further, referring to fig. 4, before step S313, the method further includes:

step S31a, judging whether each sample to be mixed meets the quantity balance rule;

if yes, go to step S313; if not, the process returns to step S312 after steps S314 and S315 are executed.

Step S314, determining the actual type corresponding to the sample to be mixed which does not meet the quantity balance rule as the target type;

step S315, adjusting a first distance threshold corresponding to the target type.

Wherein the adjustment of the first distance threshold may be increased or decreased by a predetermined magnitude. In addition, in order to ensure the accuracy of the determined first distance threshold and improve the efficiency of generating the first sample set, the first distance threshold may be determined according to the difference between the total sample distance corresponding to the target type and the total sample distance satisfying the quantity balancing rule.

Here, when a certain type of sample to be mixed does not satisfy the quantity balance rule, the first distance threshold is adjusted, so that the situation that the balance of various types of samples in the first sample set is influenced due to unreasonable setting of the first distance threshold is avoided, the accuracy of the second analysis model is effectively guaranteed, and the prediction precision of the sample to be measured is further improved.

Further, based on any of the above embodiments, another embodiment of the data analysis method of the present application is provided. In the present embodiment, referring to fig. 5, step S50 includes:

step S51, fusing the first analysis model and the second analysis model to obtain a third analysis model;

and fusing the first analysis model and the second analysis model according to a model fusion strategy to obtain a third analysis model. The model fusion strategy can be selected according to actual requirements.

Specifically, step S51 includes:

step S511, obtaining a training sample set corresponding to the first analysis model as a third sample set;

the third sample set is specifically a training sample for constructing the first analysis model.

Step S512, mixing the first sample set and the third sample set to obtain a fourth sample set;

step S513, performing feature extraction on the fourth sample set by using the first analysis model to obtain a first feature parameter, and performing feature extraction on the fourth sample set by using the second analysis model to obtain a second feature parameter;

the first characteristic parameter and the second characteristic parameter are both corresponding analysis models, and after the samples in the fourth sample set are analyzed, probability values of the samples belonging to each preset type are obtained. For example, when the preset type includes a black sample and a white sample, after the first analysis model analyzes one sample in the fourth sample set, the probability that the sample belongs to the black sample and the probability that the sample belongs to the white sample can be obtained as the first characteristic parameter, and after the second analysis model analyzes one sample in the fourth sample set, the probability that the sample belongs to the black sample and the probability that the sample belongs to the white goat can be obtained as the second characteristic parameter.

Step S514, using the first characteristic parameter, the second characteristic parameter and the actual type corresponding to each sample in the fourth sample set as characteristic parameters;

and step S515, constructing the third analysis model by using the characteristic parameters as training samples.

Step S52, analyzing the sample to be tested by using the first analysis model, the second analysis model, and the third analysis module, and determining the type of the sample to be tested.

Specifically, the sample to be tested can be analyzed by respectively adopting the first analysis model, the second analysis model and the third analysis model, and comprehensive analysis is performed based on analysis results respectively obtained by the three models to obtain the type of the sample to be tested.

In addition, the analysis result of any one or two analysis models can be used as the input data of the rest analysis models, and the output result of the rest analysis models can be used as the type of the sample to be detected. Specifically, based on the steps S511 to S515, the step S52 may specifically include: performing feature extraction on the sample to be detected by adopting the first analysis model to obtain a third feature parameter, and performing feature extraction on the sample to be detected by adopting the second analysis model to obtain a fourth feature parameter; and inputting the third characteristic parameter and the fourth characteristic parameter into the third analysis model, and taking a result output by the third analysis model as the type of the sample to be detected. The extraction manner and concept of the third feature parameter and the fourth feature parameter are similar to those of the first feature parameter and the second feature parameter, and are not described herein again.

In this embodiment, a third analysis model is generated based on the first analysis model and the second analysis model, and the three analysis models are combined to analyze the sample to be detected, so that the accuracy of the prediction result is improved. When the third analysis model is generated, the first analysis model and the second analysis model are respectively adopted to extract the characteristics of the combined training sample set, and the extracted characteristics are combined with the actual type of the sample to train the third analysis model, so that the third analysis model can accurately identify the possibility of false alarm and missed alarm of all samples in different types. Based on the method, when the type of the sample to be detected is predicted, the first analysis model and the second analysis model are respectively adopted to perform feature extraction on the sample to be detected and then input into the third analysis model to perform type prediction, so that the accuracy of the type of the obtained sample to be detected is ensured.

Further, based on any of the above embodiments, a further embodiment of the data analysis method of the present application is provided. Referring to fig. 6, step S50 includes:

step S501, judging whether Euclidean distances between each sample in the abnormal sample set and the sample to be detected are smaller than or equal to the second distance threshold value; if yes, go to step S502; if not, go to step S503.

The second distance threshold may be specifically determined according to the actual sample size of each type of sample in the abnormal sample set, the characteristic condition of each sample, the prediction accuracy requirement, and the like.

When the Euclidean distance between the sample to be detected and each sample in the abnormal sample set is smaller than or equal to the second distance threshold, the probability of false alarm missing in the analysis of the sample by adopting the first analysis model is high, and therefore the type of the sample to be detected is predicted by adopting the second analysis model; when the Euclidean distance between the sample to be detected and each sample in the abnormal sample set is not completely smaller than or equal to the second distance threshold, the probability of false alarm missing in the analysis of the sample by adopting the first analysis model is not high, and therefore the accuracy of the type of the sample to be detected can be ensured by adopting the first analysis model. Wherein, in order to further improve the accuracy of the analysis of the first analytical model. When the Euclidean distance between the sample to be detected and each sample in the abnormal sample set is not totally smaller than or equal to the second distance threshold value, and the number of the samples, the Euclidean distance between the abnormal sample set and the sample to be detected is smaller than or equal to the second distance threshold value, the first analysis model is adopted to analyze the sample to be detected; otherwise, analyzing the sample to be detected by adopting a second analysis model.

Step S502, predicting the type of the sample to be detected by adopting the second analysis model;

step S503, predicting the type of the sample to be detected by adopting the first analysis model.

In the embodiment, after the possibility of false alarm missing in the analysis of the sample to be detected by using the first analysis model is analyzed based on the Euclidean distance, the type of the sample to be detected is predicted by using the second analysis model when the possibility is high; otherwise, the first analysis model is adopted to predict the type of the sample to be detected, so that the condition of false alarm missing in the prediction of the type of the sample to be detected is effectively avoided, and the accuracy of predicting the type of the sample to be detected based on the machine learning model is improved.

Furthermore, the present invention also provides a data analysis apparatus including:

In this embodiment, specific steps executed by various modules may refer to the step operations corresponding to any embodiment of the data analysis method, which are not described herein again.

In addition, an embodiment of the present invention further provides a computer-readable storage medium, where a data analysis program is stored on the computer-readable storage medium, and when the data analysis program is executed by a processor, the data analysis program implements the step operations of any embodiment of the data analysis method.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A data analysis method, characterized in that the data analysis method comprises the steps of:

2. The data analysis method of claim 1, wherein the step of analyzing the types of the plurality of first samples by using the first analysis model to obtain the predicted type of each first sample, and obtaining the actual type of each first sample further comprises:

3. The data analysis method of claim 2, wherein the step of selecting a portion of the samples from the normal sample set to be mixed with the abnormal sample set to obtain the first sample set comprises:

4. The data analysis method of claim 3, wherein the step of selecting a portion of the samples in the normal sample set to be mixed with the abnormal sample set based on the quantitative balance rule to obtain the first sample set comprises:

5. The data analysis method of claim 3, wherein the step of selecting a portion of the samples in the normal sample set to be mixed with the abnormal sample set based on the quantitative balance rule to obtain the first sample set comprises:

6. The data analysis method of claim 5, wherein, in the first samples corresponding to each of the actual types, the step of determining, as the samples to be mixed, the first samples having the euclidean distances from each corresponding second sample smaller than or equal to the first distance threshold further comprises:

adjusting a first distance threshold corresponding to the target type;

7. The data analysis method of any one of claims 1 to 6, wherein the step of predicting the type of the sample to be tested according to the first analysis model and the second analysis model comprises:

8. The data analysis method of claim 7, wherein the step of fusing the first analytical model and the second analytical model to obtain a third analytical model comprises:

9. The data analysis method of claim 8, wherein the step of analyzing the sample to be tested by using the first analysis model, the second analysis model and the third analysis module to determine the type of the sample to be tested comprises:

10. The data analysis method of any one of claims 1 to 6, wherein the step of predicting the type of the sample to be tested according to the first analysis model and the second analysis model comprises:

11. A data analysis apparatus, characterized in that the data analysis apparatus comprises:

12. A data analysis apparatus, characterized in that the data analysis apparatus comprises: memory, a processor and a data analysis program stored on the memory and executable on the processor, the data analysis program when executed by the processor implementing the steps of the data analysis method according to any one of claims 1 to 10.

13. A readable storage medium, characterized in that the readable storage medium has stored thereon a data analysis program which, when executed by a processor, implements the steps of the data analysis method according to any one of claims 1 to 10.