CN113283512A - Data anomaly detection method, device, equipment and storage medium - Google Patents

Data anomaly detection method, device, equipment and storage medium Download PDF

Info

Publication number
CN113283512A
CN113283512A CN202110599503.5A CN202110599503A CN113283512A CN 113283512 A CN113283512 A CN 113283512A CN 202110599503 A CN202110599503 A CN 202110599503A CN 113283512 A CN113283512 A CN 113283512A
Authority
CN
China
Prior art keywords
data
abnormal
factor
exception
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110599503.5A
Other languages
Chinese (zh)
Inventor
秦婷婷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kangjian Information Technology Shenzhen Co Ltd
Original Assignee
Kangjian Information Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kangjian Information Technology Shenzhen Co Ltd filed Critical Kangjian Information Technology Shenzhen Co Ltd
Priority to CN202110599503.5A priority Critical patent/CN113283512A/en
Publication of CN113283512A publication Critical patent/CN113283512A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Testing And Monitoring For Control Systems (AREA)

Abstract

The invention relates to the field of artificial intelligence, and discloses a data anomaly detection method, a device, equipment and a storage medium, wherein the method comprises the following steps: preprocessing historical data to obtain sample data and a data exception type corresponding to the sample data, performing exception analysis on the sample data, and extracting a first exception factor corresponding to the data exception type from the sample data; calculating a linear correlation value of the data exception type and the first exception factor, and screening the first exception factor to obtain a second exception factor; training a preset detection tool by using the second abnormal factor and the sample data as training corpora to obtain an abnormal detection model; and calling the abnormality detection model to perform abnormality detection on the data to be detected. According to the invention, an anomaly detection model is constructed, anomaly detection on product operation data is realized, the cause and type of data anomaly can be positioned, and the accuracy and efficiency of data anomaly detection and data anomaly analysis are improved.

Description

Data anomaly detection method, device, equipment and storage medium
Technical Field
The invention relates to the field of artificial intelligence, in particular to a data anomaly detection method, a data anomaly detection device, data anomaly detection equipment and a storage medium.
Background
With the rapid development of internet technology and the increase of traffic, users using the internet correspondingly increase rapidly, and accordingly, more and more functions and background support systems are added to clients, and more data anomalies are generated.
At present, most companies in the market do not have real-time data boards, the real-time data maintenance cost is high, and when abnormal fluctuation occurs, it is difficult to quickly locate an abnormal point and the time of abnormal conditions. However, the service monitoring system usually only provides various operation monitoring data, but cannot detect abnormal data, so that the efficiency of analyzing data abnormality is low, and the cause of the abnormal data cannot be determined. Therefore, how to detect abnormal data and improve the data abnormality analysis efficiency is a problem which needs to be solved urgently.
Disclosure of Invention
The invention mainly aims to solve the technical problem that abnormal data cannot be detected in the prior art, so that the data abnormity analysis efficiency is low.
The first aspect of the present invention provides a data anomaly detection method, including: acquiring historical data in each application program, and preprocessing the historical data to obtain sample data and a data exception type corresponding to the sample data, wherein the historical data is exception data generated in the product operation management process; determining a data analysis rule based on the data exception type, performing exception analysis on the sample data by using the data analysis rule, and extracting at least two first exception factors; calculating linear correlation values of the data exception type and each first exception factor, and screening the first exception factors based on the linear correlation values to obtain second exception factors; taking the second abnormal factor and the sample data as training corpora, and training a preset detection tool to obtain an abnormal detection model; calling the anomaly detection model to perform anomaly detection on data to be detected, and judging whether the data to be detected is anomalous data or not based on the result of the anomaly detection, wherein the data to be detected is product operation data.
Optionally, in a first implementation manner of the first aspect of the present invention, the obtaining historical data in each application program and preprocessing the historical data to obtain sample data and a data exception type corresponding to the sample data includes: acquiring historical data in each application program, and identifying data attributes corresponding to each data in the historical data based on preset data attribute categories; judging whether the data attribute corresponding to each data in the historical data is a numerical value attribute; if so, removing the data belonging to the numerical value attribute; collecting the history data subjected to the elimination processing to form sample data; and extracting an abnormal type identifier carried by each data in the sample data, and determining a data abnormal type corresponding to each data in the sample data according to the abnormal type identifier.
Optionally, in a second implementation manner of the first aspect of the present invention, the calculating a linear correlation value between the data exception type and the first exception factor, and screening the first exception factor based on the linear correlation value to obtain a second exception factor includes: extracting all factor features which are associated with the data exception type in each first exception factor; calculating a linear correlation value between the data anomaly type and each factor feature; comparing the linear correlation value with a preset correlation threshold value; and if the linear correlation value is smaller than a preset correlation threshold value, removing the corresponding first abnormal factor from at least two first abnormal factors to obtain a second abnormal factor.
Optionally, in a third implementation manner of the first aspect of the present invention, the training a preset detection tool with the second abnormal factor and the sample data as training corpora to obtain an abnormal detection model includes: dividing each data in the sample data according to a preset division rule to respectively form a verification data set and a training data set; taking the second abnormal factor and the training data set as training corpora, and training a preset detection tool to obtain a preliminary model; and adjusting parameters of the preliminary model based on the verification data set to obtain an abnormal detection model.
Optionally, in a fourth implementation manner of the first aspect of the present invention, the training a preset detection tool with the second abnormal factor and the training data set as training corpora to obtain a preliminary model includes: classifying the training data set based on a preset self-help algorithm to obtain a classification result; based on the classification result, performing importance sorting on the second abnormal factor to obtain an abnormal factor sequence; screening the second abnormal factor according to the abnormal factor sequence to obtain a third abnormal factor; and taking the third anomaly factor and the training data set as training corpora, and training a preset detection tool to obtain a preliminary model.
Optionally, in a fifth implementation manner of the first aspect of the present invention, the classifying the training data set based on a preset self-help algorithm to obtain a classification result includes: based on a preset self-help algorithm, sampling the training data set with the training data set replaced to obtain at least one sample, and performing sample expansion processing on the at least one sample to obtain a plurality of self-help sample sets; constructing a plurality of classification trees according to the self-help sample sets; and collecting a plurality of classification trees into a random forest, and calling a preset random forest classifier to classify the random forest to obtain a classification result.
Optionally, in a sixth implementation manner of the first aspect of the present invention, the adjusting the parameters of the preliminary model based on the verification dataset to obtain an anomaly detection model includes: inputting the verification data set into the preliminary model, and outputting a detection result; evaluating the detection result based on the data abnormity type in the verification data set, and judging whether the evaluated result meets a preset standard or not; and if the evaluation result does not meet the preset standard, adjusting the parameters of the preliminary model according to a preset parameter adjusting rule to obtain an abnormal detection model.
A second aspect of the present invention provides a data abnormality detection apparatus, including: the preprocessing module is used for acquiring historical data in each application program and preprocessing the historical data to obtain sample data and a data exception type corresponding to the sample data; the extraction module is used for determining a data analysis rule based on the data exception type, carrying out exception analysis on the sample data by using the data analysis rule and extracting at least two first exception factors; the calculation module is used for calculating linear correlation values of the data exception types and the first exception factors, and screening the first exception factors based on the linear correlation values to obtain second exception factors; the training module is used for training a preset detection tool by taking the second abnormal factor and the sample data as training corpora to obtain an abnormal detection model; and the detection module is used for calling the abnormality detection model to perform abnormality detection on the data to be detected and judging whether the data to be detected is abnormal data or not based on the result of the abnormality detection.
Optionally, in a first implementation manner of the second aspect of the present invention, the preprocessing analysis module is specifically configured to: identifying data attributes corresponding to all data in historical data based on preset data attribute categories; judging whether the data attribute corresponding to each data in the historical data is a numerical value attribute; if the data attribute corresponding to each data is a numerical value attribute, removing the data belonging to the numerical value attribute; collecting the history data subjected to the elimination processing to form sample data; and extracting an abnormal type identifier carried by each data in the sample data, and determining a data abnormal type corresponding to each data in the sample data according to the abnormal type identifier.
Optionally, in a second implementation manner of the second aspect of the present invention, the calculation module is specifically configured to: extracting all factor features which are associated with the data exception type in each first exception factor; calculating a linear correlation value between the data anomaly type and each factor feature; comparing the linear correlation value with a preset correlation threshold value; and if the linear correlation value is smaller than the preset correlation threshold value, removing the corresponding first abnormal factor from at least two first abnormal factors to obtain a second abnormal factor.
Optionally, in a third implementation manner of the second aspect of the present invention, the training module includes: the dividing unit is used for dividing each data in the sample data according to a preset dividing rule to respectively form a verification data set and a training data set; the training unit is used for training a preset detection tool by taking the second abnormal factor and the training data set as training corpora to obtain a preliminary model; and the adjusting unit is used for adjusting the parameters of the preliminary model based on the verification data set to obtain an abnormal detection model.
Optionally, in a fourth implementation manner of the second aspect of the present invention, the training unit includes: the classification subunit is used for classifying the training data set based on a preset self-help algorithm to obtain a classification result; the sorting subunit is configured to perform importance sorting on the second abnormal factor based on the classification result to obtain an abnormal factor sequence; the screening subunit is used for screening the second abnormal factor according to the abnormal factor sequence to obtain a third abnormal factor; and the training subunit is used for training a preset detection tool by taking the third anomaly factor and the training data set as training corpora to obtain a preliminary model.
Optionally, in a fifth implementation manner of the second aspect of the present invention, the classification subunit is specifically configured to: based on a preset self-help algorithm, sampling the training data set with the training data set replaced to obtain at least one sample, and performing sample expansion processing on the at least one sample to obtain a plurality of self-help sample sets; constructing a plurality of classification trees according to the self-help sample sets; and collecting a plurality of classification trees into a random forest, and calling a preset random forest classifier to classify the random forest to obtain a classification result.
Optionally, in a sixth implementation manner of the second aspect of the present invention, the adjusting unit is specifically configured to: inputting the verification data set into the preliminary model, and outputting a detection result; evaluating the detection result based on the data abnormity type in the verification data set, and judging whether the evaluated result meets a preset standard or not; and if the evaluation result does not meet the preset standard, adjusting the parameters of the preliminary model according to a preset parameter adjusting rule to obtain an abnormal detection model.
A third aspect of the present invention provides a data abnormality detection apparatus including: a memory having instructions stored therein and at least one processor, the memory and the at least one processor interconnected by a line; the at least one processor invokes the instructions in the memory to cause the data anomaly detection device to perform the steps of the data anomaly detection method described above.
A fourth aspect of the present invention provides a computer-readable storage medium having stored thereon instructions which, when run on a computer, cause the computer to perform the steps of the data anomaly detection method described above.
In the technical scheme provided by the invention, sample data and a data exception type corresponding to each sample data are obtained by preprocessing historical data in each application program, and a first exception factor corresponding to the data exception type is extracted from the sample data; calculating a linear correlation value of the data exception type and the first exception factor, and screening the first exception factor to obtain a second exception factor; training a preset detection tool by using the second abnormal factor and the sample data as training corpora to obtain an abnormal detection model; and calling the abnormality detection model to perform abnormality detection on the data to be detected. According to the invention, an anomaly detection model is constructed, the detection of data anomaly is realized, the cause and the type of the data anomaly can be positioned, and the accuracy and the efficiency of data anomaly detection and data anomaly analysis are improved.
Drawings
FIG. 1 is a schematic diagram of a first embodiment of a data anomaly detection method according to an embodiment of the present invention;
FIG. 2 is a diagram of a data anomaly detection method according to a second embodiment of the present invention;
FIG. 3 is a diagram of a data anomaly detection method according to a third embodiment of the present invention;
FIG. 4 is a diagram of a fourth embodiment of a data anomaly detection method according to the embodiment of the present invention;
FIG. 5 is a schematic diagram of an embodiment of a data anomaly detection apparatus according to the present invention;
FIG. 6 is a schematic diagram of another embodiment of a data anomaly detection device according to an embodiment of the present invention;
fig. 7 is a schematic diagram of an embodiment of a data anomaly detection device in the embodiment of the present invention.
Detailed Description
The embodiment of the invention provides a data anomaly detection method, a data anomaly detection device, data anomaly detection equipment and a storage medium, wherein anomaly analysis is carried out on historical data in each application program, an anomaly factor is extracted, an anomaly detection model is constructed, and anomaly detection is carried out on each product operation data. The technical scheme of the embodiment of the invention can carry out anomaly detection on the product operation data and improve the efficiency and accuracy of data anomaly detection and data anomaly analysis.
The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," or "having," and any variations thereof, are intended to cover non-exclusive inclusions, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
For convenience of understanding, the following describes specific contents of an embodiment of the present invention, and with reference to fig. 1, a first embodiment of a data anomaly detection method according to an embodiment of the present invention includes:
101, acquiring historical data in each application program, and preprocessing the historical data to obtain sample data and a data exception type corresponding to the sample data;
the method comprises the steps that through a software development kit, abnormal data of product operation generated in the product operation management process and collected in an application program (APP) in a mobile terminal in real time are sent to an abnormal data analysis platform. The abnormal data of the product operation may include, but is not limited to, intra-two product ordering relationship data, additional product ordering information table data, product service track table data, product service instance table data, and external relationship table data. Specifically, the abnormal data analysis platform receives abnormal data of product operation uploaded by an APP in a mobile terminal used by at least one user in real time through a receiving program carried by the abnormal data analysis platform.
In this embodiment, the preprocessing process is actually to identify data attributes corresponding to each data in the historical data, and then, according to the data attributes corresponding to each data, the historical data is screened to obtain the sample data.
Specifically, each data in the historical data carries the feature information of the corresponding data attribute category, the feature information of the data attribute category of each historical data is extracted, and the data attribute corresponding to the data is judged according to the feature information and the preset data attribute category, so that the data attribute corresponding to each data in the historical data can be identified and obtained. And when the data attribute corresponding to the data is a numerical attribute, removing the data, and taking the historical data subjected to data attribute identification and removal processing as sample data.
Furthermore, when sample data is obtained through preprocessing, the method also comprises the step of determining the data exception type corresponding to each data in the sample data. Specifically, the data exception type corresponding to the currently uploaded exception data is queried according to the correspondence between the identifier and the data exception type by identifying the identifier of the exception data carried in the uploaded exception data by the mobile terminal, wherein the identifier is set by the mobile terminal when the mobile terminal uploads the exception data to the exception data analysis platform. For example, when the abnormal data is uploaded to the abnormal data analysis platform by the mobile terminal, an identifier of which preset abnormal type the abnormal data belongs to is directly carried in the abnormal data, and when the abnormal data is received by the abnormal data analysis platform, which preset abnormal type the abnormal data belongs to can be directly identified and confirmed according to the identifier. Specifically, the abnormal data obtained is analyzed for the abnormal type through an analysis program carried by the abnormal data analysis platform, and the obtained analysis result is the preset abnormal type corresponding to the abnormal data. Wherein the analysis program may be, but is not limited to, a big-data-stream type calculation typing program.
In addition, the data exception type corresponding to each data in the sample data can be determined by finding out the most probable cause to perform exception cause assumption by combining the past experience and various information, and verifying the assumption by splitting each index in the sample data and then performing multi-dimensional analysis to locate the problem. In the process, a new hypothesis may be established on the basis of the original hypothesis or the original hypothesis may be adjusted until the cause is located, and the data exception type is obtained.
102, determining a data analysis rule based on the data exception type, performing exception analysis on sample data by using the data analysis rule, and extracting at least two first exception factors;
when the data has the abnormal factor, the abnormal condition of the corresponding preset data abnormal type can be caused to occur to the data. And the extraction of the abnormal factors is to analyze the relationship between each sample data and the corresponding data abnormal type according to the determined data abnormal type corresponding to each sample data, and extract the abnormal factors causing the sample data to generate the abnormal corresponding to each data abnormal type from the sample data. The abnormal factors are at least two, one data abnormal type corresponds to at least one abnormal factor, namely, the abnormal effects of the data possibly caused by a plurality of different abnormal factors are consistent. Specifically, by analyzing the relationship between each sample data and the data exception type corresponding to the sample data, the exception factor determining that each sample data is abnormal can be regarded as a data analysis rule, and the exception factor corresponding to the sample data is determined by performing exception analysis on the sample data according to the data analysis rule. Further, the process of performing anomaly analysis on the sample data and extracting the corresponding first anomaly factor may be to perform feature extraction on each data in the sample data, that is, to extract an anomaly feature value capable of reflecting the data anomaly type from the sample data, and then calculate the anomaly factor corresponding to each data anomaly type by using the anomaly feature value, as the first anomaly factor.
103, calculating linear correlation values of the data exception types and the first exception factors, and screening the first exception factors based on the linear correlation values to obtain second exception factors;
and performing linear correlation analysis on the data exception type and the corresponding first exception factor, namely calculating a linear correlation value of the data exception type and the corresponding first exception factor, comparing the linear correlation value with a preset correlation threshold value after the corresponding linear correlation value is obtained through calculation, and judging whether the linear correlation value is smaller than the preset correlation threshold value. When the linear correlation value is smaller than a preset correlation threshold value, the data abnormal type and the corresponding abnormal factor are weak in correlation; when the linear correlation value is not less than the preset correlation threshold value, it indicates that the correlation between the data abnormality type and the corresponding abnormality factor is strong, so that the abnormality factor with strong correlation can be screened out from the first abnormality factors as a second abnormality factor according to the comparison result of the obtained linear correlation value and the preset correlation threshold value, that is, the abnormality factor with strong correlation with the data abnormality type is extracted from the first abnormality factors and is used as the second abnormality factor.
104, training a preset detection tool by taking the second abnormal factor and the sample data as training corpora to obtain an abnormal detection model;
and inputting the second abnormal factor and the sample data serving as training corpora into a preset detection tool, and training the preset detection tool. Specifically, the sample data is input into a preset detection tool, and the detection tool identifies abnormal factors in the sample data to obtain an identification result. And comparing the abnormal factor in the recognition result with the second abnormal factor to detect whether the recognition result is accurate, and when the abnormal factor in the recognition result is consistent with the second abnormal factor, indicating that the recognition result is accurate, and continuously training and verifying the detection tool according to the training process so as to obtain an abnormal detection model with higher accuracy.
And 105, calling an abnormality detection model to perform abnormality detection on the data to be detected, and judging whether the data to be detected is abnormal data or not based on the result of the abnormality detection.
Calling the generated anomaly detection model to perform anomaly detection on data to be detected, inputting the data to be detected into the anomaly detection model in the process, performing anomaly detection on the data to be detected by the anomaly detection model according to preset detection parameters, mainly judging whether an anomaly factor exists in the data to be detected, if the anomaly factor exists, indicating that the data to be detected is the anomalous data, and judging the data anomaly type according to the anomaly factor, outputting a detection result by the anomaly detection model according to the detection condition, wherein the detection result not only comprises the judgment of whether the data to be detected is the anomalous data, but also correspondingly comprises the data anomaly type corresponding to the data to be detected when the data to be detected is the anomalous data, and the data anomaly type indicates the reason of data anomaly.
In addition, after abnormal data are detected, a data monitoring system can be established, the data monitoring system can reflect the whole operation condition and the service target in the product operation management process, compare and reference the current operation data, monitor whether the operation data of each product are abnormal, find the rising or the lowering of the service index corresponding to the product operation management in time and the reasons for generation, reflect the possible future change trend of the product service line, and control the cost and the like according to the index data. Establishing a data monitoring system, wherein a product service target, KPI and a product stage are defined firstly; and according to the current product service target, carrying out index grading on the data, splitting the data into basic indexes and calculation indexes, establishing an index logic tree, and calculating the fluctuation influence of each index on abnormal indexes according to a concatenated iterative analysis method. The cascade iteration analysis method comprises the steps of splitting indexes into basic indexes to generate a waterfall analysis graph, and finding out key indexes causing index variation from a self-defined ordered index list; building a data index monitoring system report form taking days, weeks and months as units; and according to the data monitoring result, clearly managing the flow and realizing control.
After the data monitoring system is established, the data monitoring system can periodically monitor and detect the fluctuation condition of abnormal data, and judge which fluctuation type the abnormal data belongs to, wherein the fluctuation types of the data abnormality are mainly divided into three types: one-time fluctuation, periodic fluctuation, and long-term fluctuation.
A one-time fluctuation is a fluctuation of a node only at a certain time. The reasons behind one-time rises or falls are generally short-term or sudden events, such as system updates leading to data statistics errors, sudden channel drops frozen, and the like. Periodic fluctuations are seasonal factors that may periodically rise or fall, such as twenty-one, weekends, spring festival, and the like. The general business development has periodicity, such as attendance tool APP, which is circulated in units of weeks. The weekdays and weekends are clearly different fluctuations. The continuous fluctuation is that from a certain time, the ascending or descending trend always appears. The reasons behind the continuous rising or falling are deep, such as the factors of user demand transfer, long-term pause of channel delivery, large environment and the like, which lead to the occurrence of continuity. The processing modes are different aiming at different abnormal fluctuation conditions of the data, and if the data index falls, the periodical fall generally does not need to be specially processed; when the one-time drop is sudden, the continuity of the event needs to be concerned; persistent drops, especially not good at all, are more problematic the longer the duration. In addition, the trend of the trend graph in the daily, weekly and monthly report cannot be simply seen, and the larger the amplitude is, the more remarkable the abnormal problem is. The data index is an abnormal characteristic value, and may also be understood as an abnormal factor.
Specifically, a specific problem is encountered and the analysis is performed step by step. The accuracy of the data and the statistical source can be confirmed first, and actually, the indexes are very abnormal because of the problem of the data source, so the authenticity of the data must be confirmed first before the analysis is started. Server exception, errors in data background statistics and abnormal values in data reports are frequently encountered. Therefore, the first order of problems confirms that the data is not wrong, finds products related to data statistics and develops the authenticity of the data under confirmation; then, specific service conditions and abnormal conditions of the data indexes are known; the data indexes are disassembled, the calculation method of the indexes is cleared, the indexes are disassembled in a first-level and first-level mode, the indexes of which levels are abnormal are positioned, the reasons and the ranges are roughly positioned, and next-step hypothesis verification is carried out by combining actual services; then, after the abnormal range positioning, further assumption is made according to the service, the actual specific situation is specifically analyzed, and the specific analysis can be considered by adopting an 'internal-external' event factor; is it last predicted whether future will drop? What should be done to avoid a fall? And analyzing the conclusion of the business communication feedback, and discussing the execution of the subsequent scheme. And then aiming at the reason solving problem, an optimization strategy is made. Finally, we need to predict what time is affected, and communicate with the operation and product to analyze the conclusion, and discuss the implementation of the subsequent schemes.
The method has the advantages that the anomaly detection is carried out on the data to be detected through the established anomaly detection model, the interference line of the redundancy attribute can be reduced to the minimum, the abnormal point can be quickly positioned, the rising or the lowering of the service index and the generated reason can be found in time, the future possibly-changing trend of the product service line can be reflected, and the cost and the like can be controlled according to the index data.
In the embodiment of the invention, the historical data is subjected to anomaly analysis, the anomaly factor is extracted to be used as a training corpus, and an anomaly detection model is constructed to be used for carrying out anomaly detection on the data to be detected. The embodiment of the invention realizes the abnormal detection of the data, and if the data is abnormal data, the specific abnormal reason can be determined according to the established abnormal detection model, so that the efficiency and the precision of the abnormal detection are improved, and the efficiency of the subsequent abnormal data analysis is also improved.
Referring to fig. 2, a second embodiment of the data anomaly detection method according to the embodiment of the present invention includes:
the method comprises the steps that 201, data attributes corresponding to data in historical data are identified based on preset data attribute categories;
the data attribute categories are obtained by dividing the attribute categories of the data in advance, generally speaking, the attribute categories of the data have five categories, mainly including a nominal attribute, a binary attribute, an ordinal attribute, a numerical attribute, a discrete attribute and a continuous attribute.
The value of a nominal attribute is the name of some symbol or object, each value representing a certain category, code or state, so the nominal attribute is again considered to be a categorical attribute (category). These values do not have to have meaningful order and are not quantitative. A binary attribute is a nominal attribute with only two classes or states: 0 or 1, where 0 often means no occurrence and 1 means occurrence. If 0 and 1 are assigned to false and true, the binary attribute is a Boolean attribute. There is a meaningful order or rank rating between possible values of the ordinal attribute, but the difference between successive values is unknown. For example, the score attributes of students can be divided into four grades of excellence, goodness, middle and difference; the beverage cup of a certain fast food restaurant has three possible values of large, medium and small. However, it is unknown how much a particular "big" is larger than "medium". Numerical attributes are measurable quantities, expressed in whole or real numbers, of both interval and ratio scales. The interval scale attribute: the interval scale property is measured in equal unit scales. The values of the interval attributes are ordered. Therefore, in addition to rank assessment, this attribute allows for differences between comparison and quantitative assessment values; ratio scale property: the measure of the ratio-scaled property is a ratio, which may be used to describe two values, i.e., one value is a multiple of the other value, or may calculate the difference between the values. A discrete attribute has a finite or infinite number of values. Such as the student score attribute, excellent, good, medium and poor; the binary attribute takes 1 and 0 and the age attribute takes 0 to 110. If a set of possible values of an attribute is infinite, but a one-to-one correspondence with a natural number can be established, it is also a discrete attribute. An attribute is continuous if it is not discrete.
Each data in the historical data carries the characteristic information of the corresponding data attribute category, the characteristic information of the data attribute category of each historical data is extracted, and the data attribute corresponding to the data is judged according to the characteristic information and the preset data attribute category, so that the data attribute corresponding to each data in the historical data can be identified and obtained.
202, judging whether the data attribute corresponding to each data in the historical data is a numerical value attribute;
when the data attribute corresponding to each data in the historical data is identified and obtained, the historical data belonging to the numerical value attribute is screened out from all the historical data, namely whether the data attribute corresponding to each data in the historical data is the numerical value attribute is judged. Specifically, field division is performed on data attributes corresponding to each piece of data in the historical data, semantic recognition is performed on the divided fields according to a preset semantic recognition tool, the recognized semantics and the semantics of the numerical attributes are compared in a semantic space, and if the semantics are consistent, the data attributes of the corresponding pieces of data are numerical attributes. It should be noted that semantic recognition of fields according to a semantic recognition tool belongs to the prior art, and is not described herein again.
203, if the data attribute corresponding to each data is a numerical value attribute, removing the data belonging to the numerical value attribute;
204, collecting the history data subjected to the elimination processing to form sample data;
and when the data attribute corresponding to the data is identified to be the numerical value attribute, removing the corresponding data from the historical data, namely screening and removing the data belonging to the numerical value attribute from the historical data. And collecting the history data subjected to the elimination processing into a data set as sample data.
And when the data attribute corresponding to each data is not the numerical value attribute, not removing each data, and taking the data which do not belong to the numerical value attribute as sample data.
205, extracting an abnormal type identifier carried by each data in the sample data, and determining a data abnormal type corresponding to each data in the sample data according to the abnormal type identifier;
in this embodiment, it may be defined that when the mobile terminal uploads the abnormal data, the abnormal type of the abnormal data is indicated. For example, when the abnormal data is uploaded by the mobile terminal, an identifier of which preset abnormal type the abnormal data belongs to may be directly carried in the abnormal data, and when the abnormal data analysis platform receives the abnormal data, the identifier may be directly used to distinguish and confirm which preset abnormal type the abnormal data belongs to. Specifically, the abnormal data obtained is analyzed for the abnormal type through an analysis program carried by the abnormal data analysis platform, and the obtained analysis result is the preset abnormal type corresponding to the abnormal data. Wherein the analysis program may be, but is not limited to, a big-data-stream type calculation typing program.
206, determining a data analysis rule based on the data exception type, performing exception analysis on the sample data by using the data analysis rule, and extracting a corresponding first exception factor;
when the data has the abnormal factor, the abnormal condition of the corresponding preset data abnormal type can be caused to occur to the data. And the extraction of the abnormal factors is to analyze the relationship between each sample data and the corresponding data abnormal type according to the determined data abnormal type corresponding to each sample data, and extract the abnormal factors causing the sample data to generate the abnormal corresponding to each data abnormal type from the sample data. One data exception type corresponds to at least one exception factor, namely the data exception effects possibly caused by a plurality of different exception factors are consistent. Specifically, by analyzing the relationship between each sample data and the data exception type corresponding to the sample data, the exception factor determining that each sample data is abnormal can be regarded as a data analysis rule, and the exception factor corresponding to the sample data is determined by performing exception analysis on the sample data according to the data analysis rule. Further, the process of performing anomaly analysis on the sample data and extracting the corresponding first anomaly factor may be to perform feature extraction on each data in the sample data, that is, to extract an anomaly feature value capable of reflecting the data anomaly type from the sample data, and then calculate the anomaly factor corresponding to each data anomaly type by using the anomaly feature value, as the first anomaly factor.
207, extracting all factor features which are associated with the data exception type from each first exception factor;
208, calculating a linear correlation value between the data anomaly type and each factor characteristic;
each abnormal factor carries a factor characteristic which reflects the association with the data abnormal type, the abnormal factors are correspondingly associated with the data abnormal type through the factor characteristics carried by the abnormal factors, and one abnormal factor carries one factor characteristic. And extracting all factor characteristics which are associated with the data exception type in each first exception factor, and calculating a linear correlation value between the data exception type and each factor characteristic. Specifically, the data anomaly type is used as a dependent variable, the factor characteristic of a first anomaly factor is used as an independent variable, and the Pearson correlation coefficient of the two variables is calculated, wherein the Pearson correlation coefficient between the two variables is defined as the quotient of the covariance and the standard deviation between the two variables. Since one data exception type corresponds to at least one first exception factor, the linear correlation value between each data exception type and the factor characteristic of the first exception factor is calculated to be at least one. It should be noted that, the calculation of the pearson correlation coefficient between two variables belongs to the prior art, and is not described herein again.
209, comparing the linear correlation value with a preset correlation threshold;
and after the linear correlation value between each data abnormality type and the first abnormality factor is calculated, comparing the linear correlation value with a preset correlation threshold value, judging whether the linear correlation value is smaller than the preset correlation threshold value, and when the linear correlation value is smaller than the preset correlation threshold value, indicating that the data abnormality type is weaker in correlation with the corresponding first abnormality factor. The setting of the correlation threshold may be set according to actual situations, and the correlation threshold is not limited in this embodiment.
In practical applications, the absolute value of the linear correlation value is generally above 0.8, and it is considered that there is a strong correlation between the two variables. Between 0.3 and 0.8, a weak correlation can be considered. 0.3 or less, no correlation is considered.
In practical problems, the correlation coefficient is generally calculated by using sample data, and therefore has a certain randomness, and particularly when the sample capacity is small, the randomness is larger, at this time, the reliability of estimating the overall correlation coefficient by using the sample correlation coefficient is greatly questioned, that is, the correlation coefficient cannot explain whether two populations from which the sample comes have a significant linear relationship. Therefore, it is necessary to make statistical inferences about the variables and determine whether there is a correlation between the variables by means of tests.
210, if the linear correlation value is smaller than a preset correlation threshold value, removing the corresponding first abnormal factor from the at least two first abnormal factors to obtain a second abnormal factor;
when the linear correlation value between the data exception type and the corresponding first exception factor is smaller than the preset correlation threshold value, which indicates that the data exception type and the corresponding first exception factor do not have strong correlation, the corresponding first exception factor is removed, that is, the corresponding exception factor is removed from all the first exception factors, and the first exception factor after being removed is used as a second exception factor.
When the linear correlation value between the data exception type and the corresponding first exception factor is not less than the preset correlation threshold value, it is indicated that the data exception type and the first exception factor have strong correlation, so that the first exception factor is used as an element for influencing data exception, and the first exception factors having strong correlation with the data exception type are used as second exception factors.
211, dividing each data in the sample data according to a preset division rule to obtain a verification data set and a training data set;
and dividing each data in the sample data into a training data set and a verification data set according to a preset division rule. The training data set is used as a training corpus and used for training a preset detection tool, the verification data set is used as verification data and used for verifying a detection result of the preset detection tool, and the prediction accuracy of the detection tool is continuously improved. In this embodiment, the preset partition rule may be defined as randomly dividing sample data into two or eight parts, randomly extracting 20% of data from the sample data as verification data, and 80% of data as training data.
212, training a preset detection tool by using the second abnormal factor and the training data set as training corpora to obtain a preliminary model;
the second abnormal factor and the training data set are used as training corpora to train a preset detection tool, the second abnormal factor and the training data set are input into the preset detection tool in the process, detection parameters in the detection tool are continuously adjusted, and therefore the detection tool can conduct abnormal detection on data, and a preliminary model is obtained.
213, adjusting the parameters of the preliminary model based on the verification data set to obtain an abnormal detection model;
carry out the verification of testing result to preliminary model through verifying the data set, its process is mainly, each data input that will verify the data set carries out the anomaly detection to preliminary model in, then the model testing result that corresponds via preliminary model output, according to the original anomaly detection result that each data correspond in the verification data set, compare the model testing result of preliminary model output, if the model testing result has discrepancy with original anomaly detection result, then regard as the training corpus with verifying the data set, constantly adjust the detection parameter of preliminary model, until can obtaining higher detection accuracy, thereby obtain the anomaly detection model.
And 214, calling an abnormality detection model to perform abnormality detection on the data to be detected, and judging whether the data to be detected is abnormal data or not based on the result of the abnormality detection.
Calling the generated anomaly detection model to perform anomaly detection on data to be detected, inputting the data to be detected into the anomaly detection model in the process, performing anomaly detection on the data to be detected by the anomaly detection model according to preset detection parameters, mainly judging whether an anomaly factor exists in the data to be detected, if the anomaly factor exists, indicating that the data to be detected is the anomalous data, and judging the data anomaly type according to the anomaly factor, outputting a detection result by the anomaly detection model according to the detection condition, wherein the detection result not only comprises the judgment of whether the data to be detected is the anomalous data, but also correspondingly comprises the data anomaly type corresponding to the data to be detected when the data to be detected is the anomalous data, and the data anomaly type indicates the reason of data anomaly.
In the embodiment of the invention, data attributes corresponding to historical data are identified, the historical data are screened according to the data attributes corresponding to the data to obtain sample data, then abnormal factors with strong linear correlation with data abnormal types are screened from the sample data, and abnormal detection models are trained to perform abnormal detection on the data by taking the abnormal factors and the sample data as training corpora. In the embodiment, the historical data is screened to obtain the sample data, and then the abnormal factor is screened to construct the abnormal detection model, so that the precision of the model is improved, and the accuracy of data abnormal detection is improved.
Referring to fig. 3, a third embodiment of the data anomaly detection method according to the embodiment of the present invention includes:
301, obtaining historical data in each application program, and preprocessing the historical data to obtain sample data and a data exception type corresponding to the sample data;
302, determining a data analysis rule based on the data exception type, performing exception analysis on sample data by using the data analysis rule, and extracting at least two first exception factors;
303, calculating linear correlation values of the data exception types and the first exception factors, and screening the first exception factors based on the linear correlation values to obtain second exception factors;
304, dividing each data in the sample data according to a preset division rule to obtain a verification data set and a training data set;
and dividing each data in the sample data into a training data set and a verification data set according to a preset division rule. The training data set is used as a training corpus and used for training a preset detection tool, the verification data set is used as verification data and used for verifying a detection result of the preset detection tool, and the prediction accuracy of the detection tool is continuously improved. In this embodiment, the preset partition rule may be defined as randomly dividing sample data into two or eight parts, randomly extracting 20% of data from the sample data as verification data, and 80% of data as training data.
305, classifying the training data set based on a preset self-help algorithm to obtain a classification result;
the training data set is classified mainly based on a preset self-help algorithm, and a plurality of data are randomly extracted from the training data set in a release mode to serve as a training subset. After random extraction is carried out on the training data set for multiple times, a plurality of training subsets can be formed, a plurality of classification trees are constructed to form random forests, and the random forests are classified by using a good random forest classifier, so that the training data set is classified, namely, each abnormal factor in the training data set is classified to obtain a classification result.
306, sorting the importance of the second abnormal factor based on the classification result to obtain an abnormal factor sequence;
and sorting the importance of the second abnormal factors according to the obtained classification result, namely sorting the importance of the second abnormal factors according to the classification result of the random forest classifier for classifying the random forest, wherein the classification result output by the random forest classifier is the importance ranking of each abnormal factor, and sorting the importance of the second abnormal factors according to the importance ranking to obtain an abnormal factor sequence.
307, screening the second abnormal factor according to the abnormal factor sequence to obtain a third abnormal factor;
and screening the second abnormal factors according to the obtained abnormal factor sequence, wherein the abnormal factors with higher importance are screened from the second abnormal factors according to the abnormal factor sequence, and the abnormal factors with higher importance are used as third abnormal factors. The importance is mainly determined according to a preset importance threshold, the importance ranking of each second abnormal factor is compared with the preset importance threshold, and when the importance ranking of the second abnormal factor is larger than the preset importance threshold, the second abnormal factor is used as the abnormal factor with higher importance.
308, taking the third anomaly factor and the training data set as training corpora, and training a preset detection tool to obtain a preliminary model;
the method comprises the following steps of taking a third abnormal factor and a training data set as training corpora, training a preset detection tool, inputting the training data set into the detection tool, carrying out abnormal detection on the training data set, identifying an abnormal factor in the training data set, comparing the abnormal factor with the third abnormal factor, and when the abnormal factor is the third abnormal factor, indicating that a detection result is correct, so that the detection precision of the detection tool is continuously improved, and a preliminary model is obtained.
309, adjusting parameters of the preliminary model based on the verification data set to obtain a data anomaly detection model;
carry out the verification of testing result to preliminary model through verifying the data set, its process is mainly, each data input that will verify the data set carries out the anomaly detection to preliminary model in, then the model testing result that corresponds via preliminary model output, according to the original anomaly detection result that each data correspond in the verification data set, compare the model testing result of preliminary model output, if the model testing result has discrepancy with original anomaly detection result, then regard as the training corpus with verifying the data set, constantly adjust the detection parameter of preliminary model, until can obtaining higher detection accuracy, thereby obtain data anomaly detection model.
And 310, calling an abnormality detection model to perform abnormality detection on the data to be detected, and judging whether the data to be detected is abnormal data or not based on the result of the abnormality detection.
Calling the generated anomaly detection model to perform anomaly detection on data to be detected, inputting the data to be detected into the anomaly detection model in the process, performing anomaly detection on the data to be detected by the anomaly detection model according to preset detection parameters, mainly judging whether an anomaly factor exists in the data to be detected, if the anomaly factor exists, indicating that the data to be detected is the anomalous data, and judging the data anomaly type according to the anomaly factor, outputting a detection result by the anomaly detection model according to the detection condition, wherein the detection result not only comprises the judgment of whether the data to be detected is the anomalous data, but also correspondingly comprises the data anomaly type corresponding to the data to be detected when the data to be detected is the anomalous data, and the data anomaly type indicates the reason of data anomaly.
In the embodiment of the present invention, the steps 301-303 are the same as the steps 101-103 in the first embodiment of the data abnormality detection method, and are not described herein again.
In the embodiment of the invention, the training data set is classified through a preset self-help algorithm, and the importance screening is carried out on the abnormal factors, so that an abnormal detection model with higher precision is trained. The anomaly detection model constructed by the embodiment can improve the accuracy of anomaly detection, and meanwhile, the subsequent data anomaly analysis is carried out by using the anomaly detection result, so that the efficiency of data anomaly analysis is improved.
Referring to fig. 4, a fourth embodiment of the data anomaly detection method according to the embodiment of the present invention includes:
401, obtaining historical data in each application program, and preprocessing the historical data to obtain sample data and a data exception type corresponding to the sample data;
402, determining a data analysis rule based on the data exception type, performing exception analysis on sample data by using the data analysis rule, and extracting at least two first exception factors;
403, calculating linear correlation values of the data exception type and each first exception factor, and screening each first exception factor based on the linear correlation values to obtain a second exception factor;
404, dividing each data in the sample data according to a preset division rule to obtain a verification data set and a training data set;
and dividing each data in the sample data into a training data set and a verification data set according to a preset division rule. The training data set is used as a training corpus and used for training a preset detection tool, the verification data set is used as verification data and used for verifying a detection result of the preset detection tool, and the prediction accuracy of the detection tool is continuously improved. In this embodiment, the preset partition rule may be defined as randomly dividing sample data into two or eight parts, randomly extracting 20% of data from the sample data as verification data, and 80% of data as training data.
405, performing replaced sampling on the training data set based on a preset self-help algorithm to obtain at least one sample, and performing sample expansion processing on the at least one sample to obtain a plurality of self-help sample sets;
406, constructing a plurality of classification trees according to the plurality of self-help sample sets;
based on a preset self-help algorithm (Bootstrap sampling method), a preset number of samples are extracted from a training data set by adopting a repeated replacement random sampling technology, sample expansion is carried out, a plurality of training subsets are generated, wherein the training subsets are self-help sample sets, a plurality of classification trees are constructed according to the plurality of self-help sample sets, and one classification tree corresponds to one self-help sample set.
When the training subset is generated by sampling, a Bootstrap sampling method is used, and a repeated sampling technology is adopted to extract a certain number of samples (generally the same as the original samples) from the original samples. Therefore, the training subsets generated each time are different from the training data set, and randomness exists.
Specifically, based on a preset self-help algorithm, k new self-help sample sets are randomly extracted in a put-back manner for the training data set, k times of extraction are performed on the training data set to obtain k training subsets, k classification trees are constructed accordingly, and k pieces of data outside bags are formed by samples which are not extracted each time.
407, collecting all the classification trees into random forests, and calling a preset random forest classifier to classify the random forests to obtain classification results;
and collecting all the classification trees into a random forest, and then calling a preset random forest classifier to classify the random forest to obtain a classification result. Specifically, if there are mall variables, then at each node of each classification tree, randomly extracting mtry variables (mtry < < mall), and then selecting one variable with the most classification capability from the mtry, wherein the selected variable with the most classification capability is selected by adopting a node division method. In addition, the threshold for variable classification is determined by examining each classification point. Each classification tree is guaranteed to grow to the maximum extent without trimming, a plurality of generated classification trees form a random forest, a preset random forest classifier is used for distinguishing and classifying new data, classification results are determined according to the votes of the random forest classifier, and the classification results are the importance ranking of each abnormal factor in a training data set. The random forest classifier is used to classify the random forest as the prior art, and is not described herein.
408, sorting the importance of the second abnormal factor based on the classification result to obtain an abnormal factor sequence;
and sorting the importance of the second abnormal factors according to the obtained classification result, namely sorting the importance of the second abnormal factors according to the classification result of the random forest classifier for classifying the random forest, wherein the classification result output by the random forest classifier is the importance ranking of each abnormal factor, and sorting the importance of the second abnormal factors according to the importance ranking to obtain an abnormal factor sequence.
409, screening the second abnormal factor according to the abnormal factor sequence to obtain a third abnormal factor;
and screening the second abnormal factors according to the obtained abnormal factor sequence, wherein the abnormal factors with higher importance are screened from the second abnormal factors according to the abnormal factor sequence, and the abnormal factors with higher importance are used as third abnormal factors. The importance is mainly determined according to a preset importance threshold, the importance ranking of each second abnormal factor is compared with the preset importance threshold, and when the importance ranking of the second abnormal factor is larger than the preset importance threshold, the second abnormal factor is used as the abnormal factor with higher importance.
410, training a preset detection tool by taking the third anomaly factor and the training data set as training corpora to obtain a preliminary model;
the method comprises the following steps of taking a third abnormal factor and a training data set as training corpora, training a preset detection tool, inputting the training data set into the detection tool, carrying out abnormal detection on the training data set, identifying an abnormal factor in the training data set, comparing the abnormal factor with the third abnormal factor, and when the abnormal factor is the third abnormal factor, indicating that a detection result is correct, so that the detection precision of the detection tool is continuously improved, and a preliminary model is obtained.
411, inputting the verification data set into the preliminary model, and outputting a detection result;
inputting each data in the verification data set into a preliminary model, then carrying out abnormity detection on the verification data set through the preliminary model, analyzing each data in the verification data set by the preliminary model, and outputting a detection result, wherein the detection result is a model detection result.
412, evaluating the detection result based on the data abnormal type in the verification data set, and judging whether the evaluation result meets a preset standard;
before inputting each data in the verification data set into the preliminary model, determining the actual abnormal detection result and the corresponding data abnormal type of each data, comparing the model detection result output by the preliminary model with the actual abnormal detection result of the verification data set, namely performing model verification through the verification data, evaluating the model detection result, and judging whether the evaluated result meets the preset standard.
413, if the evaluation result does not meet the preset standard, adjusting the parameters of the preliminary model according to a preset parameter adjusting rule to obtain a data anomaly detection model;
and when the evaluation result does not meet the preset standard, adjusting the parameters of the preliminary model, namely when the model detection result is different from the actual abnormal detection result, repeatedly adjusting the parameters of the preliminary model to obtain the optimal parameters, so as to achieve the optimal fitting effect and further obtain the data abnormal detection model. The preset standard means that the detection result is the same as the actual abnormal detection result. When the evaluation result meets the preset standard, the precision of the parameters of the preliminary model reaches a standard value, namely the preliminary model is a high-precision model, so that the preliminary model can be used as a data anomaly detection model.
And 414, calling an anomaly detection model to perform anomaly detection on the data to be detected, and judging whether the data to be detected is anomalous data or not based on the anomaly detection result.
Calling the generated anomaly detection model to perform anomaly detection on data to be detected, inputting the data to be detected into the anomaly detection model in the process, performing anomaly detection on the data to be detected by the anomaly detection model according to preset detection parameters, mainly judging whether an anomaly factor exists in the data to be detected, if the anomaly factor exists, indicating that the data to be detected is the anomalous data, and judging the data anomaly type according to the anomaly factor, outputting a detection result by the anomaly detection model according to the detection condition, wherein the detection result not only comprises the judgment of whether the data to be detected is the anomalous data, but also correspondingly comprises the data anomaly type corresponding to the data to be detected when the data to be detected is the anomalous data, and the data anomaly type indicates the reason of data anomaly.
In the embodiment of the present invention, the steps 401-403 are the same as the steps 101-103 in the first embodiment of the data anomaly detection method, and are not described herein again.
In the embodiment of the invention, the trained preliminary model is subjected to parameter adjustment through the verification data set, and the precision of the model parameters is continuously improved, so that the accuracy of the constructed abnormal detection model is higher, and the accuracy and the efficiency of data abnormal detection can be improved when the abnormal detection model is called to carry out abnormal detection on data.
With reference to fig. 5, the data anomaly detection method in the embodiment of the present invention is described above, and a data anomaly detection device in the embodiment of the present invention is described below, where an embodiment of the data anomaly detection device in the embodiment of the present invention includes:
the preprocessing module 501 is configured to obtain historical data in each application program, and preprocess the historical data to obtain sample data and a data exception type corresponding to the sample data;
an extracting module 502, configured to determine a data analysis rule based on the data exception type, perform exception analysis on the sample data by using the data analysis rule, and extract at least two first exception factors;
a calculating module 503, configured to calculate a linear correlation value between the data exception type and each of the first exception factors, and screen each of the first exception factors based on the linear correlation value to obtain a second exception factor;
a training module 504, configured to train a preset detection tool with the second abnormal factor and the sample data as training corpora to obtain an abnormal detection model;
the detecting module 505 is configured to invoke the anomaly detection model to perform anomaly detection on the data to be detected, and determine whether the data to be detected is anomalous data based on a result of the anomaly detection.
According to the embodiment of the invention, the anomaly detection is carried out on the data to be detected by constructing the anomaly detection model, so that the anomaly detection of the data is realized, the accuracy and efficiency of the data anomaly detection are improved, and the subsequent anomaly analysis of the anomalous data is facilitated.
Referring to fig. 6, another embodiment of the data anomaly detection apparatus in the embodiment of the present invention includes:
the preprocessing module 501 is configured to obtain historical data in each application program, and preprocess the historical data to obtain sample data and a data exception type corresponding to the sample data;
an extracting module 502, configured to determine a data analysis rule based on the data exception type, perform exception analysis on the sample data by using the data analysis rule, and extract at least two first exception factors;
a calculating module 503, configured to calculate a linear correlation value between the data exception type and each of the first exception factors, and screen each of the first exception factors based on the linear correlation value to obtain a second exception factor;
a training module 504, configured to train a preset detection tool with the second abnormal factor and the sample data as training corpora to obtain an abnormal detection model;
the detecting module 505 is configured to invoke the anomaly detection model to perform anomaly detection on data to be detected, and determine whether the data to be detected is anomalous data based on a result of the anomaly detection.
Optionally, the preprocessing module 501 is specifically configured to:
acquiring historical data in each application program, and identifying data attributes corresponding to each data in the historical data based on preset data attribute categories;
judging whether the data attribute corresponding to each data in the historical data is a numerical value attribute;
if the data attribute corresponding to each data is a numerical value attribute, removing the data belonging to the numerical value attribute; collecting the history data subjected to the elimination processing to form sample data;
and extracting an abnormal type identifier carried by each data in the sample data, and determining a data abnormal type corresponding to each data in the sample data according to the abnormal type identifier.
Optionally, the calculating module 503 is specifically configured to:
extracting all factor features which are associated with the data exception type in each first exception factor;
calculating a linear correlation value between the data anomaly type and each factor feature;
comparing the linear correlation value with a preset correlation threshold value;
and if the linear correlation value is smaller than a preset correlation threshold value, removing the corresponding first abnormal factor from at least two first abnormal factors to obtain a second abnormal factor.
Optionally, the training module 504 includes:
the dividing unit 5041 is configured to divide each data in the sample data according to a preset dividing rule to form a verification data set and a training data set respectively;
a training unit 5042, configured to train a preset detection tool with the second abnormal factor and the training data set as training corpora to obtain a preliminary model;
an adjusting unit 5043, configured to adjust parameters of the preliminary model based on the verification data set, so as to obtain an anomaly detection model.
Optionally, the training unit 5042 includes:
a classification subunit 50421, configured to classify the training data set based on a preset self-help algorithm, so as to obtain a classification result;
a sorting subunit 50422, configured to perform importance sorting on the second abnormal factor based on the classification result, to obtain an abnormal factor sequence;
a screening subunit 50423, configured to screen the second abnormal factor according to the abnormal factor sequence to obtain a third abnormal factor;
and a training subunit 50424, configured to train a preset detection tool with the third anomaly factor and the training data set as training corpora to obtain a preliminary model.
Optionally, the classification subunit 50421 is specifically configured to:
based on a preset self-help algorithm, sampling the training data set with the training data set replaced to obtain at least one sample, and performing sample expansion processing on the at least one sample to obtain a plurality of self-help sample sets;
constructing a plurality of classification trees according to the self-help sample sets;
and collecting a plurality of classification trees into a random forest, and calling a preset random forest classifier to classify the random forest to obtain a classification result.
Optionally, the adjusting unit 5043 is specifically configured to:
inputting the verification data set into the preliminary model, and outputting a detection result;
evaluating the detection result based on the data abnormity type in the verification data set, and judging whether the evaluated result meets a preset standard or not;
and if the evaluation result does not meet the preset standard, adjusting the parameters of the preliminary model according to a preset parameter adjusting rule to obtain an abnormal detection model.
In the embodiment of the invention, the accuracy of the model parameters is continuously improved by screening the abnormal factors and adjusting the parameters of the constructed preliminary model by using the verification data set, so that the accuracy of the constructed abnormal detection model is higher, and the efficiency of data abnormal detection is improved.
Referring to fig. 7, an embodiment of the data anomaly detection device in the embodiment of the present invention is described in detail from the perspective of hardware processing.
Fig. 7 is a schematic structural diagram of a data anomaly detection apparatus 700 according to an embodiment of the present invention, where the data anomaly detection apparatus 700 may generate relatively large differences due to different configurations or performances, and may include one or more processors (CPUs) 710 (e.g., one or more processors) and a memory 720, and one or more storage media 730 (e.g., one or more mass storage devices) storing an application 733 or data 732. Memory 720 and storage medium 730 may be, among other things, transient storage or persistent storage. The program stored on the storage medium 730 may include one or more modules (not shown), each of which may include a series of instructions operating on the data anomaly detection apparatus 700. Still further, the processor 710 may be configured to communicate with the storage medium 730 to execute a series of instruction operations in the storage medium 730 on the data anomaly detection device 700.
The data anomaly detection apparatus 700 may also include one or more power supplies 740, one or more wired or wireless network interfaces 750, one or more input-output interfaces 760, and or one or more operating systems 731, such as Windows Server, Mac OS X, Unix, Linux, FreeBSD, and the like. Those skilled in the art will appreciate that the data anomaly detection device configuration shown in FIG. 7 does not constitute a limitation of the data anomaly detection device, and may include more or fewer components than shown, or some components in combination, or a different arrangement of components.
The present invention also provides a computer-readable storage medium, which may be a non-volatile computer-readable storage medium, and which may also be a volatile computer-readable storage medium, having stored therein instructions, which, when run on a computer, cause the computer to perform the steps of the data anomaly detection method.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A data anomaly detection method is characterized by comprising the following steps:
acquiring historical data in each application program, and preprocessing the historical data to obtain sample data and a data exception type corresponding to the sample data, wherein the historical data is exception data generated in the product operation management process;
determining a data analysis rule based on the data exception type, performing exception analysis on the sample data by using the data analysis rule, and extracting at least two first exception factors;
calculating linear correlation values of the data exception type and each first exception factor, and screening each first exception factor based on the linear correlation values to obtain a second exception factor;
taking the second abnormal factor and the sample data as training corpora, and training a preset detection tool to obtain an abnormal detection model;
calling the anomaly detection model to perform anomaly detection on data to be detected, and judging whether the data to be detected is anomalous data or not based on the result of the anomaly detection, wherein the data to be detected is product operation data.
2. The method according to claim 1, wherein the obtaining historical data in each application program and preprocessing the historical data to obtain sample data and a data exception type corresponding to the sample data comprises:
acquiring historical data in each application program, and identifying data attributes corresponding to each data in the historical data based on preset data attribute categories;
judging whether the data attribute corresponding to each data in the historical data is a numerical value attribute;
if so, removing the data belonging to the numerical value attribute;
collecting the history data subjected to the elimination processing to form sample data;
and extracting an abnormal type identifier carried by each data in the sample data, and determining a data abnormal type corresponding to each data in the sample data according to the abnormal type identifier.
3. The method according to claim 2, wherein the calculating a linear correlation value between the data anomaly type and the first anomaly factor, and screening the first anomaly factor based on the linear correlation value to obtain a second anomaly factor comprises:
extracting all factor features which are associated with the data exception type in each first exception factor;
calculating a linear correlation value between the data anomaly type and each factor feature;
comparing the linear correlation value with a preset correlation threshold value;
and if the linear correlation value is smaller than a preset correlation threshold value, removing the corresponding first abnormal factor from at least two first abnormal factors to obtain a second abnormal factor.
4. The method according to claim 3, wherein the training a preset detection tool with the second anomaly factor and the sample data as training corpora to obtain an anomaly detection model comprises:
dividing each data in the sample data according to a preset division rule to respectively form a verification data set and a training data set;
taking the second abnormal factor and the training data set as training corpora, and training a preset detection tool to obtain a preliminary model;
and adjusting parameters of the preliminary model based on the verification data set to obtain an abnormal detection model.
5. The data anomaly detection method according to claim 4, wherein the training a preset detection tool by using the second anomaly factor and the training data set as training corpora to obtain a preliminary model comprises:
classifying the training data set based on a preset self-help algorithm to obtain a classification result;
based on the classification result, performing importance sorting on the second abnormal factor to obtain an abnormal factor sequence;
screening the second abnormal factor according to the abnormal factor sequence to obtain a third abnormal factor;
and taking the third anomaly factor and the training data set as training corpora, and training a preset detection tool to obtain a preliminary model.
6. The data anomaly detection method according to claim 5, wherein the classifying the training data set based on a preset self-help algorithm to obtain a classification result comprises:
based on a preset self-help algorithm, sampling the training data set with the training data set replaced to obtain at least one sample, and performing sample expansion processing on the at least one sample to obtain a plurality of self-help sample sets;
constructing a plurality of classification trees according to the self-help sample sets;
and collecting a plurality of classification trees into a random forest, and calling a preset random forest classifier to classify the random forest to obtain a classification result.
7. The method according to any one of claims 4 to 6, wherein the adjusting parameters of the preliminary model based on the validation dataset to obtain an anomaly detection model comprises:
inputting the verification data set into the preliminary model, and outputting a detection result;
evaluating the detection result based on the data abnormity type in the verification data set, and judging whether the evaluated result meets a preset standard or not;
and if the evaluation result does not meet the preset standard, adjusting the parameters of the preliminary model according to a preset parameter adjusting rule to obtain an abnormal detection model.
8. A data abnormality detection device, characterized in that the data abnormality detection device comprises:
the preprocessing module is used for acquiring historical data in each application program and preprocessing the historical data to obtain sample data and a data exception type corresponding to the sample data;
the extraction module is used for determining a data analysis rule based on the data exception type, carrying out exception analysis on the sample data by using the data analysis rule and extracting at least two first exception factors;
the calculation module is used for calculating linear correlation values of the data exception types and the first exception factors, and screening the first exception factors based on the linear correlation values to obtain second exception factors;
the training module is used for training a preset detection tool by taking the second abnormal factor and the sample data as training corpora to obtain an abnormal detection model;
and the detection module is used for calling the abnormality detection model to perform abnormality detection on the data to be detected and judging whether the data to be detected is abnormal data or not based on the result of the abnormality detection.
9. A data abnormality detection apparatus characterized by comprising:
a memory having instructions stored therein and at least one processor, the memory and the at least one processor interconnected by a line;
the at least one processor invoking the instructions in the memory to cause the data anomaly detection apparatus to perform the steps of the data anomaly detection method of any one of claims 1-7.
10. A computer readable storage medium having instructions stored thereon, which when executed by a processor implement the steps of the data anomaly detection method according to any one of claims 1-7.
CN202110599503.5A 2021-05-31 2021-05-31 Data anomaly detection method, device, equipment and storage medium Pending CN113283512A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110599503.5A CN113283512A (en) 2021-05-31 2021-05-31 Data anomaly detection method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110599503.5A CN113283512A (en) 2021-05-31 2021-05-31 Data anomaly detection method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113283512A true CN113283512A (en) 2021-08-20

Family

ID=77282524

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110599503.5A Pending CN113283512A (en) 2021-05-31 2021-05-31 Data anomaly detection method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113283512A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117610699A (en) * 2023-09-04 2024-02-27 北京中电飞华通信有限公司 Zero-carbon comprehensive energy optimization equipment and method applied to park

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117610699A (en) * 2023-09-04 2024-02-27 北京中电飞华通信有限公司 Zero-carbon comprehensive energy optimization equipment and method applied to park

Similar Documents

Publication Publication Date Title
CN106951984B (en) Dynamic analysis and prediction method and device for system health degree
US10621493B2 (en) Multiple record linkage algorithm selector
US20070061144A1 (en) Batch statistics process model method and system
CN111177714A (en) Abnormal behavior detection method and device, computer equipment and storage medium
CN107168995B (en) Data processing method and server
EP3340136A1 (en) Systems and methods for determining relationships between defects
EP1958034B1 (en) Use of sequential clustering for instance selection in machine condition monitoring
CN113051291A (en) Work order information processing method, device, equipment and storage medium
CN111984442A (en) Method and device for detecting abnormality of computer cluster system, and storage medium
US20210397956A1 (en) Activity level measurement using deep learning and machine learning
KR20190110084A (en) Esg based enterprise assessment device and operating method thereof
CN115204536A (en) Building equipment fault prediction method, device, equipment and storage medium
CN111242170B (en) Food inspection and detection project prediction method and device
CN114780606B (en) Big data mining method and system
CN113098912B (en) User account abnormity identification method and device, electronic equipment and storage medium
US20130198147A1 (en) Detecting statistical variation from unclassified process log
CN113283512A (en) Data anomaly detection method, device, equipment and storage medium
CN115952426B (en) Distributed noise data clustering method based on random sampling and user classification method
US20230156043A1 (en) System and method of supporting decision-making for security management
CN117170915A (en) Data center equipment fault prediction method and device and computer equipment
CN116842240A (en) Data management and control system based on full-link management and control
CN108763242B (en) Label generation method and device
CN113190426A (en) Stability monitoring method for big data scoring system
CN111199419B (en) Stock abnormal transaction identification method and system
CN114064757A (en) Application program optimization method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination