CN115048464A - User operation behavior data detection method and device and electronic equipment - Google Patents
User operation behavior data detection method and device and electronic equipment Download PDFInfo
- Publication number
- CN115048464A CN115048464A CN202110251231.XA CN202110251231A CN115048464A CN 115048464 A CN115048464 A CN 115048464A CN 202110251231 A CN202110251231 A CN 202110251231A CN 115048464 A CN115048464 A CN 115048464A
- Authority
- CN
- China
- Prior art keywords
- data
- user
- abnormal
- operation behavior
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 80
- 230000006399 behavior Effects 0.000 claims abstract description 217
- 230000002159 abnormal effect Effects 0.000 claims abstract description 125
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 103
- 238000000605 extraction Methods 0.000 claims abstract description 88
- 238000000034 method Methods 0.000 claims abstract description 69
- 230000009467 reduction Effects 0.000 claims abstract description 54
- 238000004458 analytical method Methods 0.000 claims abstract description 26
- 238000007405 data analysis Methods 0.000 claims abstract description 21
- 238000007621 cluster analysis Methods 0.000 claims abstract description 19
- 239000013598 vector Substances 0.000 claims description 33
- 238000012549 training Methods 0.000 claims description 29
- 239000011159 matrix material Substances 0.000 claims description 23
- 238000012545 processing Methods 0.000 claims description 19
- 238000004364 calculation method Methods 0.000 claims description 17
- 230000007246 mechanism Effects 0.000 claims description 16
- 238000013144 data compression Methods 0.000 claims description 12
- 238000013528 artificial neural network Methods 0.000 claims description 11
- 238000004590 computer program Methods 0.000 claims description 11
- 230000015654 memory Effects 0.000 claims description 9
- 230000006870 function Effects 0.000 claims description 8
- 238000012546 transfer Methods 0.000 claims description 8
- 230000005856 abnormality Effects 0.000 claims description 7
- 238000006243 chemical reaction Methods 0.000 claims description 6
- 238000011084 recovery Methods 0.000 claims description 6
- 238000010606 normalization Methods 0.000 claims description 5
- 238000000354 decomposition reaction Methods 0.000 claims description 4
- 238000001914 filtration Methods 0.000 claims description 4
- 239000010410 layer Substances 0.000 description 52
- 238000012706 support-vector machine Methods 0.000 description 18
- 238000000513 principal component analysis Methods 0.000 description 16
- 230000008569 process Effects 0.000 description 10
- 238000005070 sampling Methods 0.000 description 8
- 238000013135 deep learning Methods 0.000 description 7
- 238000010586 diagram Methods 0.000 description 6
- 238000010801 machine learning Methods 0.000 description 6
- 238000002955 isolation Methods 0.000 description 5
- 238000003064 k means clustering Methods 0.000 description 5
- 230000002457 bidirectional effect Effects 0.000 description 4
- 238000004891 communication Methods 0.000 description 4
- 238000012217 deletion Methods 0.000 description 4
- 230000037430 deletion Effects 0.000 description 4
- 230000000694 effects Effects 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 238000002372 labelling Methods 0.000 description 4
- 238000012544 monitoring process Methods 0.000 description 4
- 238000011160 research Methods 0.000 description 4
- 230000002547 anomalous effect Effects 0.000 description 3
- 238000007418 data mining Methods 0.000 description 3
- 238000003066 decision tree Methods 0.000 description 3
- 239000000284 extract Substances 0.000 description 3
- 230000008520 organization Effects 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 206010000117 Abnormal behaviour Diseases 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000013480 data collection Methods 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 230000004927 fusion Effects 0.000 description 2
- 238000007689 inspection Methods 0.000 description 2
- 230000010354 integration Effects 0.000 description 2
- 238000007477 logistic regression Methods 0.000 description 2
- 238000012423 maintenance Methods 0.000 description 2
- 238000002156 mixing Methods 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 238000007637 random forest analysis Methods 0.000 description 2
- 238000012216 screening Methods 0.000 description 2
- 230000006403 short-term memory Effects 0.000 description 2
- 239000000654 additive Substances 0.000 description 1
- 230000000996 additive effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000000546 chi-square test Methods 0.000 description 1
- 238000007635 classification algorithm Methods 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000013499 data model Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 238000011049 filling Methods 0.000 description 1
- 230000036039 immunity Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000002265 prevention Effects 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 238000013077 scoring method Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 239000002356 single layer Substances 0.000 description 1
- 230000007847 structural defect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000002194 synthesizing effect Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/288—Entity relationship models
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Biomedical Technology (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Debugging And Monitoring (AREA)
- Input From Keyboards Or The Like (AREA)
Abstract
The invention provides a method and a device for detecting user operation behavior data and electronic equipment, belonging to the technical field of computers, wherein the method comprises the following steps: collecting user operation behavior data; performing entity extraction on the user operation behavior data to obtain entity identification data; performing feature selection and feature dimension reduction on the entity identification data to obtain feature data subjected to dimension reduction; performing clustering analysis on the characteristic data to obtain classified data of various operation behaviors; and performing data analysis on the classified data by adopting an anomaly detection algorithm to obtain normal data of normal operation behaviors of the user and abnormal data of abnormal operation behaviors of the user. The invention can effectively detect the abnormal data of the abnormal operation behaviors of the user by performing entity extraction, feature selection, feature dimension reduction, cluster analysis and abnormal detection algorithm analysis on the data of the user operation behaviors.
Description
Technical Field
The invention relates to the technical field of computers, in particular to a method and a device for detecting user operation behavior data and electronic equipment.
Background
In the prior art, an anomaly detection system plays an important role in discovering the violation behaviors in the network. Because it is difficult to directly extract abnormal traffic from mass data, the existing abnormal detection equipment adopts a mode of randomly sampling all traffic data and further analyzing the extracted abnormal traffic, but because the traffic data of normal behaviors of users in a network is far more than the abnormal traffic data, a large amount of abnormal traffic can be missed by a random sampling mode. The conventional machine learning, deep learning algorithm or random sampling in the prior art is adopted for anomaly detection, and the following problems mainly exist in the actual operation process: the parameter setting is difficult, the assumed conditions are excessive, the data content is more limited, and the like.
Disclosure of Invention
The invention provides a method and a device for detecting user operation behavior data and electronic equipment, which are used for solving the problems that a large amount of abnormal flow is omitted in the process of carrying out abnormal detection on user behaviors and the problems that the parameter setting of an algorithm is difficult, the assumed condition is excessive, the data content is limited more and the like exist in the process of carrying out abnormal detection on related algorithms in the prior art, and the real-time monitoring and the prediction on possible illegal operations are realized according to the conditions of the user operation behaviors.
The invention provides a method for detecting user operation behavior data, which comprises the following steps:
acquiring user operation behavior data, wherein the user operation behavior data is used for analyzing whether the operation behavior of a user is abnormal or not;
performing entity extraction on the user operation behavior data to obtain entity identification data, wherein the entity identification data is used for extracting data related to abnormal operation behaviors of the user;
performing feature selection and feature dimension reduction on the entity identification data to obtain feature data subjected to dimension reduction, wherein the feature data are data for realizing feature extraction and data compression through feature selection and feature dimension reduction;
performing clustering analysis on the characteristic data to obtain classification data of various operation behaviors, wherein the classification data is used for classifying the various operation behaviors of the user;
and performing data analysis on the classified data by adopting an anomaly detection algorithm to obtain normal data of normal operation behaviors of the user and abnormal data of abnormal operation behaviors of the user.
According to the detection method for the user operation behavior data, provided by the invention, the collection of the user operation behavior data comprises the following steps:
acquiring user operation behavior data based on a first database, wherein the first database stores relational data and log data for recording various operation behaviors of a user;
the user operation behavior data comprises data of one or more combinations of various operation starting/ending times, operation specific steps, operation sequences and operation final results of the user.
According to the detection method of the user operation behavior data provided by the invention, the entity extraction is carried out on the user operation behavior data to obtain entity identification data, and the method comprises the following steps:
marking partial data of the user operation behavior data to be used as training data, and training an entity extraction model by utilizing a neural network;
performing entity extraction on the user operation behavior data based on the entity extraction model to obtain entity identification data; wherein,
the first layer of the entity extraction model is a word embedding layer and is used for training an input word sequence into a word vector to be output;
the second layer of the entity extraction model is used for inputting the word vectors output by the first layer into a BilSTM layer for training so as to learn the relation between words and output labels, the BilSTM layer comprises a forward LSTM network and a reverse LSTM network, and the forward LSTM network and the reverse LSTM network are connected through an output layer;
the third layer of the entity extraction model is that an attention model is arranged on an output sequence of a BilSTM layer and is used for processing a label problem so that the entity extraction model can better focus on local characteristics and highlight the important function of keywords;
and the fourth layer of the entity extraction model is a CRF layer used after the attention mechanism and is used for outputting transfer scores among the labels through a transfer matrix and obtaining an optimal label sequence based on the conversion rule of each label and the rationality of label grammar.
According to the method for detecting user operation behavior data provided by the invention, the step of performing feature selection and feature dimension reduction on the entity identification data to obtain the feature data after dimension reduction comprises the following steps:
summarizing the entity identification data and data stored in a second database, wherein the second database stores data for handling user services;
processing abnormal values/repeated values appearing in the data;
performing feature selection on the processed data, and storing the feature selection data subjected to selective filtering;
calculating a covariance matrix representing the correlation of the data based on the feature selection data, and performing feature decomposition on the covariance matrix to obtain a feature value and a feature vector set;
and projecting the characteristic value and the characteristic vector set to a characteristic matrix to obtain characteristic data after dimension reduction, and storing the characteristic data.
According to the detection method of the user operation behavior data provided by the invention, the clustering analysis is carried out on the characteristic data to obtain the classification information of various operation behaviors, and the method comprises the following steps:
based on a K-means density clustering algorithm, dividing the set of the characteristic data into objects belonging to different cluster classes according to characteristic similarity, wherein the method comprises the steps of distributing the data with similar characteristics in the same cluster and distributing the data with dissimilar characteristics outside the cluster;
performing data analysis based on the density of the characteristic data distribution to obtain classification data of various operation behaviors;
the K-means density clustering algorithm is characterized in that a threshold is preset before clustering, a weight is calculated based on the density of the feature data, the intra-cluster average distance and the inter-cluster distance, the distance of the feature data is calculated by adopting a weighted Euclidean distance, and an initial clustering center is selected according to the density, the weight and the distance of the feature data obtained through calculation to obtain initial input parameters of the K-means density clustering algorithm.
According to the detection method of the user operation behavior data provided by the invention, the classified data is subjected to data analysis based on an abnormal detection algorithm to obtain normal data of the normal operation behavior of the user and abnormal data of the illegal operation behavior of the user, and the method comprises the following steps:
respectively carrying out anomaly scoring on the classified data by adopting three anomaly detection algorithms of an isolated forest, a One Class SVM and a local anomaly factor to obtain corresponding anomaly scoring values;
performing weighted normalization on the abnormal score values output by the three abnormal detection algorithms to obtain the ranking of the abnormal score values for all users;
and determining normal data of the normal operation behaviors of the user and abnormal data of the illegal operation behaviors of the user according to the ranking of the abnormal score values.
According to the method for detecting the user operation behavior data provided by the invention, after the classified data is subjected to data analysis based on an abnormal detection algorithm to obtain normal data of the normal operation behavior of the user and abnormal data of the illegal operation behavior of the user, the method further comprises the following steps:
if the abnormal data is determined as the abnormal data of the illegal operation behaviors of the user, a system administrator and related technicians are informed in a mail and short message mode, and a disaster recovery mechanism is started for part of the abnormal data to solve the abnormal problem.
The invention also provides a device for detecting the user operation behavior data, which comprises:
the data acquisition module is used for acquiring user operation behavior data, wherein the user operation behavior data is data describing various operation behaviors of a user;
the entity extraction module is used for performing entity extraction on the user operation behavior data to obtain entity identification data, and the entity identification data is data related to abnormal data extracted from the user operation behavior data;
the characteristic selection module is used for carrying out characteristic selection and characteristic dimension reduction on the entity identification data to obtain characteristic data subjected to dimension reduction, and the characteristic data is data for realizing characteristic extraction and data compression through the characteristic selection and the characteristic dimension reduction;
the cluster analysis module is used for carrying out cluster analysis on the characteristic data to obtain classification data of various operation behaviors, and the classification data is used for classifying the various operation behaviors of the user;
and the abnormality detection module is used for carrying out data analysis on the classified data by adopting an abnormality detection algorithm to obtain normal data of normal operation behaviors of the user and abnormal data of illegal operation behaviors of the user.
The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein when the processor executes the program, the steps of the method for detecting the user operation behavior data are realized.
The present invention also provides a non-transitory computer readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, implements the steps of the method for detecting user operation behavior data as described in any one of the above.
According to the method and the device for detecting the user operation behavior data and the electronic equipment, provided by the invention, the abnormal data of the abnormal operation behavior of the user can be effectively detected by performing entity extraction, feature selection, feature dimension reduction, cluster analysis and abnormal detection algorithm analysis on the user operation behavior data.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a schematic flow chart of a method for detecting user operation behavior data according to the present invention;
FIG. 2 is a schematic flow chart of the entity extraction step provided by the present invention;
FIG. 3 is a schematic diagram of an entity extraction model provided by the present invention;
FIG. 4 is a schematic flow chart of the feature processing steps provided by the present invention;
FIG. 5 is a schematic flow chart of the cluster analysis steps provided by the present invention;
FIG. 6 is a schematic flow chart of an anomaly scoring step provided by the present invention;
FIG. 7 is a schematic structural diagram of a device for detecting user operation behavior data according to the present invention;
fig. 8 is a schematic structural diagram of an electronic device provided in the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The terms "first," "second," and the like in the description and in the claims, and in the drawings described above, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein.
With the advent of the big data era, the number of users of the internet is sharply increased, the data volume in the network shows a trend of massive growth, and the problem of cheating and violation in the network system security shows a trend of increasing year by year. Through the research and analysis of the user operation, the user with illegal operation can be found as early as possible, so that the network safety and the normal operation of the system are guaranteed. Therefore, finding out an illegal operation in a user behavior through anomaly detection is a problem that needs to be solved at present.
As the network prevention mechanism still needs to be perfected, it becomes more important to perform security monitoring and cheat detection. The abnormity detection is to detect whether violation behaviors exist by collecting information during user operation and analyzing the collected information. In the prior art, common anomaly detection methods mainly include machine learning, deep learning, and the like, for example, methods such as Decision Tree (Decision Tree), Random Forest (Random Forest), Support Vector Machine (SVM), AdaBoost, GBDT (gradient boosting Decision Tree), neural network, and the like.
In the prior art, the conventional machine learning, deep learning algorithm or random sampling is used for anomaly detection, and the following problems mainly exist in the actual operation process:
first, parameter setting is difficult.
The traditional anomaly detection algorithms have higher difficulty in finding the optimal parameters, particularly the methods based on the proximity, the algorithms quantize the anomaly degree through the outlier strength concept, the consumed time and the complexity increase along with the dimension, the parameter search is difficult, and a large amount of time is consumed in the modeling process to determine the relevant parameters of the model.
Second, feature engineering may not be accurate enough and individual algorithms assume too many conditions.
Up to now, there have been many methods of performing abnormal behavior analysis using a user operation log. However, in terms of feature engineering, there is no detailed description of the system, and statistical features related to classes cannot be used in a single-layer classification model, which limits the detection efficiency. The traditional classification algorithm includes a logistic regression algorithm (LR), a support vector machine algorithm (SVM), a naive bayes algorithm (NB), a K-nearest neighbor algorithm (KNN), and the like. For the logistic regression model, when the feature space is large, the expression effect of the model is not good, and an overfitting phenomenon is easy to occur. When the observed variables are more, the classification efficiency of the support vector machine is not very high, and it is difficult to find a suitable kernel function. For a naive Bayes model, the model is sensitive to the expression form of input data, prior probability needs to be calculated, the time and space complexity of a K neighbor model is high, long running time needs to be spent, and efficiency is low. In addition to this, these algorithms cannot simultaneously satisfy low variance and low bias. For example, naive bayes is a high-bias, low-variance classifier, whereas K-neighbor models are low-bias, high-variance classifiers. Therefore, the abnormal behavior abnormality detection system based on the traditional machine learning algorithm has the characteristic that the balance between the detection rate and the false alarm rate cannot be achieved.
Thirdly, the cost investment of the manual maintenance mode is large.
Whether user operations are in compliance sometimes also requires a judgment by a professional skilled in this regard. When artifical fortune dimension, artifical cost is higher, and the more complicated manpower that needs to drop into of system is more, and the cost can be higher naturally, and artifical fortune dimension can not accomplish 24 hours incessantly to carry out anomaly monitoring work.
Fourth, the data content is more restrictive.
The traditional anomaly detection algorithm needs data items which are already counted digital data during training, but user log files stored in some systems are mostly unstructured text data which contain a large amount of important information, and if the data are not extracted, the result is greatly influenced.
Therefore, based on the problems in the prior art, the present invention provides a method, an apparatus, and an electronic device for detecting user operation behavior data, which can effectively detect abnormal data of abnormal user operation behavior by analyzing various behavior data of foreground operations of a user, monitoring the behavior of the user in real time by combining a data mining technology, and predicting possible illegal operations.
The technical terms to which the present invention relates are described below:
(1) information extraction
Information Extraction (IE) is an organization form in which Information contained in a text is structured and converted into a table. The input of the information extraction system is original text, and the output is fixed format information points. Information points are extracted from various documents and then integrated together in a unified form. This is the main task of information extraction. The advantage of integrating information together in a uniform fashion is that it facilitates inspection and comparison. Information extraction techniques do not attempt to fully understand the entire document, but rather analyze portions of the document that contain relevant information. As to which information is relevant, that will be determined by the domain scope under which the system was designed.
The information extraction task mainly comprises entity extraction, relation extraction and the like. Entity extraction, also called Named Entity Recognition (NER) for short, refers to recognizing Entity naming terms with specific meanings from unstructured text and noting the categories (such as names of people, places, organization names, amount of money, etc.). Specifically, the task of entity identification is to identify named entities in three major categories (entity category, time category and number category), seven minor categories (name of person, organization name, place name, time, date, currency and percentage) in the text to be processed.
Entity recognition generally requires two tasks, specifically, recognizing entity word boundaries and recognizing entity word classes. The emphasis points of Chinese and English in the recognition task are different, the characteristics of entity information in English are obvious and are usually written in the first letter of a word, so that the difficulty of the NER task of the original text is relatively simple, and the emphasis points pay more attention to recognition of entity word categories. However, the task of entity recognition in chinese is more difficult, and not only the entity category is emphasized, but also the entity boundary needs to be found.
(2) Cluster analysis
Cluster analysis refers to an analytical process that groups a collection of physical or abstract objects into classes that are composed of similar objects. It is an important human behavior.
The goal of cluster analysis is to collect data on a similar basis for classification. Clustering is derived from many fields, including mathematics, computer science, statistics, biology and economics. In different fields of application, many clustering techniques have been developed, and these techniques are used to describe data and measure the similarity between different data sources, and to classify data sources into different clusters.
(3) Anomaly detection
In data mining, anomaly detection identifies items, events, or observations that do not match an expected pattern or other item in a dataset. Often abnormal items can translate into problems of the type of bank fraud, structural defects, medical problems, text errors, etc. Anomalies are also known as outliers, novelties, noise, deviations, and exceptions.
There are three major types of anomaly detection methods. Unsupervised anomaly detection methods can detect anomalies in unlabeled test data by finding the instances that are the most mismatched to other data, assuming that most of the instances in the dataset are normal. Supervised anomaly detection methods require a data set that has been labeled "normal" and "abnormal" and involve training classifiers (the key difference from many other statistical classification problems is the inherent imbalance of anomaly detection). Semi-supervised anomaly detection methods create a model representing normal behavior from a given normal training data set and then detect the likelihood of test cases generated by the learning model.
The following describes a method, an apparatus and an electronic device for detecting user operation behavior data according to the present invention with reference to fig. 1 to fig. 8.
Fig. 1 is a schematic flow chart of a method for detecting user operation behavior data according to the present invention, as shown in the figure. A method for detecting user operation behavior data comprises the following steps:
Optionally, all user operation behavior data including the foreground system level may be collected based on a first database (system database), where the first database stores relational data and log data for recording various user operation behaviors.
The user operation behavior data includes, but is not limited to, data of one or more combinations of starting/ending time of various operations, specific steps of the operations, operation sequence and operation final result of the user.
And 102, performing entity extraction on the user operation behavior data to obtain entity identification data, wherein the entity identification data is used for extracting data related to abnormal operation behaviors of the user.
Optionally, the entity extraction algorithm of the improved LSTM-CRF of the present invention may perform entity extraction on the collected user operation behavior data, and may extract data related to abnormal operation behavior, such as user operation behavior name, from a large amount of unstructured text data.
Because the foreground of the user in the system operates the special conditions of more log data, larger historical data amount, less abnormal data and the like, and certain entity extraction work needs to be carried out on the operation behaviors in the log data.
And 103, performing feature selection and feature dimension reduction on the entity identification data to obtain feature data subjected to dimension reduction, wherein the feature data is data for realizing feature extraction and data compression through feature selection and feature dimension reduction.
Optionally, for the problem of the excessively high dimensionality of the entity identification data, feature dimension reduction processing based on PCA (principal component analysis) may be used, so that the complexity of the prediction model may be reduced, the feature weight with lower importance on the model may be reduced, missing data may be eliminated, and the accuracy of subsequent modeling may be improved.
The data collected in the system has many characteristics, and the problem of dimension disaster can exist. The 'dimension disaster' causes key factors and data to be submerged and cannot be mined, so that the prediction precision falls into a bottleneck and is difficult to continue to improve, and the prediction model is more and more complex due to high-dimension and huge data, and the calculation speed is reduced. Based on the problems, the invention adopts a Principal Component Analysis (PCA) -based method to perform dimensionality reduction processing on high-dimensionality data, improves the prediction precision, reduces the complexity of a prediction model, and realizes feature extraction and data compression.
And 104, performing cluster analysis on the characteristic data to obtain classification data of various operation behaviors, wherein the classification data is used for classifying the various operation behaviors of the user.
Optionally, a K-means (K-means) density clustering algorithm may be used to divide the user operation data set into objects belonging to different clusters, so that the operation behavior characteristics distributed in the same cluster are highly similar, and the characteristic difference between the objects in different clusters is large, until all the points are aggregated. By using the clustering analysis technical method, not only can the sparse and dense areas of the operation data be quickly demarcated and identified in real time, but also the independent clusters, independent points and the like can be found in time, so that the inherent correlation mathematical relationship hidden behind each data can be mined and analyzed.
Based on the data subjected to feature dimensionality reduction in the step 103, the user operation data is subjected to cluster analysis based on an improved K-Means algorithm, the improved algorithm simultaneously considers the sample density, the intra-cluster average distance and the inter-cluster distance, various related mathematical relationship rules in the dynamic data are searched and found, and the intelligence value of the dynamic operation data is mined, so that prediction and decision service is provided for user operation compliance. The operation data research based on the clustering algorithm is introduced, so that the cost of manually and randomly sampling the users can be reduced, the data mining efficiency of the users can be further improved, and the accuracy of the abnormal analysis can be optimized, so that the dispersity and the locality of the traditional abnormal analysis are changed, and the operation data research inevitably becomes an inevitable trend of the internal development of the user behavior analysis.
And 105, performing data analysis on the classified data by adopting an anomaly detection algorithm to obtain normal data of normal operation behaviors of the user and abnormal data of abnormal operation behaviors of the user.
Optionally, the user operation behavior may be predicted and calculated based on the score values of three anomaly detection algorithms (isolated forest, One Class SVM, and local anomaly factor) that are weighted and fused, and the clustering result in the last step 104 is further analyzed to comprehensively identify and evaluate various abnormal user operations that most possibly affect the system.
The main task of anomaly detection analysis is to extract small-probability anomaly data points in a normal user data set, the method adopts the integration of three algorithms of an isolated forest, One-Class SVM and local anomaly factors to comprehensively identify and evaluate various abnormal user operations which most possibly affect the system, and the three algorithms are used for completing anomaly detection in a weighted fusion mode, so that the anomaly scores of all operation behaviors can be respectively obtained, prediction judgment is not carried out by only depending on One anomaly detection algorithm, and the prediction accuracy and efficiency can be greatly improved.
In summary, in order to perform compliance analysis on user operation behaviors more accurately and perform prediction alarm on problems which may occur in the future, the present invention collects various data of user foreground operations, and combines entity identification technology, feature selection, feature dimension reduction, text clustering analysis and anomaly detection algorithm to model all user operation behavior data which need to be concerned, thereby effectively detecting abnormal data of abnormal user operation behaviors.
The above steps 101 to 105 will be described below by way of specific examples.
Fig. 2 is a schematic flow chart of the entity extraction step provided by the present invention, and fig. 3 is a schematic structural diagram of the entity extraction model provided by the present invention, as shown in fig. 2 and fig. 3. In the step 102, the extracting the entity from the user operation behavior data to obtain the entity identification data includes:
Since the system database holds a large number of log files in addition to relational data, and the log files record various operation information of the user, it is necessary to extract relevant operation behavior entity information from the log files. However, if the data is extracted by manual screening or labeling, a large amount of labor cost is consumed, and the accuracy cannot be guaranteed.
Therefore, the invention uses an entity recognition method in natural language processing technology and combines a deep learning algorithm to extract entity information from the log files stored in the system database. The specific mode is that firstly, part of training data is labeled, a neural network training entity is used for extracting a model, so that the neural network learns the syntactic and lexical characteristics in a log file, and finally the model is used for predicting more data.
Because text data in the user operation log often faces situations that text length is not fixed and the text carries many irrelevant network words, and the like, and the Recognition effect of the traditional Entity extraction model is greatly influenced aiming at the special situation, the invention optimizes and adjusts the traditional NER (Name Entity Recognition) model, and the details are as follows:
(1) first layer of entity extraction model
The first layer of the entity extraction model is a word embedding layer and is used for training an input word sequence into a word vector to be output.
Specifically, the invention adopts a CBOW (continuous vocabulary) model in Word2Vec (Word to vector, a relevant model for generating Word vectors) to carry out Word vector training, the CBOW model determines the position of each Word through context analysis, and the Word vector of each Word is output to be used as each time step of next layer neural network input.
(2) Second layer of entity extraction model
The second layer of the entity extraction model is used for inputting the word vectors output by the first layer into a BilSTM (Bi-directional Long Short-Term Memory, abbreviated as BilSTM) layer for training so as to learn the relation between words and output labels, the BilSTM layer comprises a forward LSTM (Long Short-Term Memory, abbreviated as LSTM) network and a reverse LSTM network, and the forward LSTM network and the reverse LSTM network are connected through an output layer. The bidirectional LSTM network obtains corresponding implicit output sequences through forward LSTM and backward LSTM, the implicit output sequences are spliced to form a complete implicit sequence at each moment and serve as the input of the next layer, and a matrix formed by hidden states generated through a BiLSTM layer is H ═ H 1 ,h 2 ,……,h j }。
The following describes the algorithmic improvement of the BilSTM layer (as shown in FIG. 3):
the traditional NER model uses a one-way LSTM structure, and the one-way LSTM structure can only record input before t time steps and cannot acquire information of future time steps. When the text space is short, the model needs to more effectively grasp only the feature information and consider the context to more effectively capture the features.
The bi-directional LSTM structure (BiLSTM) can effectively solve this problem, and the BiLSTM is composed of two backward unidirectional LSTM, and two network structures are connected by an output layer. The forward LSTM enters data into a neural network structure through an input layer, and obtains a training result on an output layer according to a normal calculation and transmission mode. The inverse LSTM means that in the training process, the neural network transmits errors to the input layer by layer, and network parameters of each layer are updated according to the errors. The bidirectional LSTM model simultaneously considers the sequence information of the past and future moments, realizes the aim of completely recording the future and past information of each time step, and when the text space is short, the predicted result can be relatively accurate.
(3) Third layer of the entity abstraction model
The third layer of the entity extraction model is to add an attention mechanism (attention model) on an output sequence of a BilSTM layer, wherein the attention mechanism (attention model) is used for processing a label problem so as to enable the entity extraction model to better focus on local features and highlight the important function of keywords, different weights are distributed to the output of the BilSTM layer, and new output vectors are obtained by adding products of the feature vectors and the corresponding weights.
For the model output vector at the moment i, the model performs weighted summation calculation on the hidden layer output of the coded source sequence by using the attention weight distribution vector to obtain a current output source sequence coding result, and the formula is as follows:
wherein, c i Means for outputting new character feature vector by using attention mechanism, which is each feature vector h output by the pre-order model j And corresponding weight a ij Is calculated. a is ij From the feature vector c of the previous time word i-1 And h j Calculated by the following two equations. The attention tier, i.e., the outputs at all times multiplied by the corresponding weights are added as the final output as follows:
e ij =v a tanh(w a c i-1 +w b h j )。
wherein v is a ,w a ,w b Are weights.
The above-mentioned attention coefficient a ij Hidden layer h generated by BilSTM, also called perceptron j Is given by the perceptron a ij To measure the relationship with the position i of the output tag. The hidden layer not only contains the global information of the text, but also containsAnd obtaining the output state of the current time step by weighting and summing the local keyword information containing the text. Then, linear conversion is needed to be carried out to enable the linear conversion to correspond to the label dimensionality, and a final output vector is obtained through a softmax (used for converting the output result of the neural network into a probability expression) algorithm. In exchange for higher accuracy, the attention model used in the model is composed of an additive model.
The Attention Model (Attention Model) introduced by the invention can be widely applied to different deep learning fields, and can help the NER Model to better focus local features and grasp text emphasis in extremely small space. And an attention model is introduced, which focuses on other words near the tagged word, and ignores more distant or irrelevant word information as appropriate. The probability distribution values represent the attention values of the respective words given by the attention model, effectively illustrating the areas in which the attention model is focused.
Combining with the BilSTM, synthesizing a transformation function of the middle semantics of the whole sentence, wherein the formula is as follows:
current State C of the attention model i By the length L of the input sentence x Attention coefficient a ij And the state value h of the jth word j And (4) jointly determining. The updating of the attention model is determined by the attention coefficient, and the more attention the output item gives to the input item, the corresponding a ij The larger the value.
(4) Layer four of the entity extraction model
And the fourth layer of the entity extraction model is a CRF layer used after the attention mechanism and is used for outputting transfer scores among the labels through a transfer matrix and obtaining an optimal label sequence based on the conversion rule of each label and the rationality of label grammar.
Specifically, using CRF after the attention mechanism, viterbi decoding can be used to obtain the best tag sequence, outputting the best solution.
Therefore, the entity extraction algorithm of the LSTM-CRF is adopted in the prior art, and the invention adopts the entity extraction algorithm of the BiLSTM-CRF which is improved on the LSTM-CRF, wherein the BiLSTM is composed of a bidirectional LSTM network structure. CRF is a common sequence labeling algorithm and can be used for tasks such as part-of-speech labeling, word segmentation, named entity recognition and the like. The BilSTM + CRF adopted by the invention combines the BilSTM and the CRF together, so that the model not only can consider the relevance between the front and the back of the sequence like the CRF, but also can have the feature extraction and fitting capability of the LSTM.
In summary, the invention applies the entity extraction technology in the natural language processing field to the user operation data collection, and correspondingly improves the named entity recognition model aiming at the particularity of the user log text, and on the basis of the LSTM-CRF named entity recognition model in the prior art, the unidirectional LSTM is changed into the bidirectional LSTM, and the attention model is added. The improved named entity recognition model is applied to user behavior analysis work oriented to the data field, named entity recognition is carried out on texts which are mainly concerned in operation steps, an anomaly detection system is helped to efficiently mine valuable information, and a good effect is achieved on feature capture of mass log information.
FIG. 4 is a flow chart illustrating the steps of the feature processing provided by the present invention, as shown. In the step 103, the performing feature selection and feature dimension reduction on the entity identification data to obtain feature data after dimension reduction includes:
combining the entity identification data extracted in step 102 with the discrete data stored in the second database, the data may have a "dimensional disaster" problem. On one hand, key factors and data are submerged due to dimension disasters and cannot be mined, so that the prediction accuracy is trapped in a bottleneck and is difficult to improve continuously; on the other hand, the prediction model is more and more complex due to the high-dimensional and huge amount of data, the calculation speed is more and more slow, the calculation capacity has to be continuously expanded, and the waste of the calculation capacity is caused, so that in order to continuously improve the prediction precision and reduce the complexity of the prediction model, the dimension reduction processing on the high-dimensional data is necessary when the feature vector set is constructed. The invention adopts feature reduction and feature selection based on Principal Component Analysis (PCA) to realize feature extraction and data compression. The method comprises the following specific steps:
Specifically, the two types of data, i.e., the processed entity identification data, stored in the system database and the data (i.e., the data transacting the user service) stored in the second database are loaded and aggregated together.
Specifically, some abnormal data in the data are processed, for example, the performance data exceeds the record of the normal range threshold, and the abnormal values are respectively removed by adopting a direct deletion method; and processing the repeated phenomenon in the data, wherein the repeated value may be caused by repeated starting of the platform program or a problem in the warehousing stage. A merge method may be employed to merge equal records into one record by determining whether the attribute values between the records are equal.
And step 403, performing feature selection on the processed data, and storing the feature selection data subjected to selection filtering.
In machine learning, feature selection generally serves two purposes: firstly, the number of features is reduced, and the training speed is improved; second, noise characteristics are reduced to improve the accuracy of the model over the test set. There are many common feature selection algorithms, such as chi-squared test and mutual information.
Specifically, a selection result is obtained for the discrete type data through a discrete calculation method, and the selection result mainly comprises chi-square inspection and mutual information; and obtaining a selection result for the continuous type data through a continuous calculation method, wherein the selection result mainly comprises a Pearson correlation coefficient (Pearson correlation coefficient) and a Fisher's scoring method, and the characteristic data after being selected and filtered is stored to provide support for further data analysis.
And step 404, calculating a covariance matrix representing the data correlation based on the feature selection data, and performing feature decomposition on the covariance matrix to obtain a feature value and a feature vector set.
And 405, projecting the characteristic value and the characteristic vector set to a characteristic matrix to obtain characteristic data after dimension reduction, and storing the characteristic data.
Specifically, the data dimension reduction can be realized through a Principal Component Analysis (PCA) algorithm, and the stored feature data after dimension reduction can be used as a data base of a deep learning prediction system and a big data analysis processing system.
The Principal Component Analysis (PCA) algorithm described above is as follows:
and performing principal component analysis on the operation behavior data of the user foreground to obtain principal component components with reduced dimensionality. And (3) arranging all operation behavior data into a sample matrix, wherein the matrix size is m multiplied by k dimensions:
centering the sample matrix:
calculating the variance of the feature data set:
wherein X represents characteristic data X i A collection of (a).
Calculating the eigenvalue of covariance matrix, taking out the eigenvector corresponding to the largest d eigenvalues, outputting projection matrix, and assuming that the transformed coordinate system is { w } 1 ,w 2 ,…,w d W is an orthonormal basis vector. If the data is reduced in dimension, the characteristic data x i Projection in the lower dimensional coordinate system is z i =(z i1 ,z i2 ,…,z id ) In z at i To construct x i The result is:
where const μ is a constant and can be ignored.
To achieve the dimensionality reduction effect, the above formula should be minimized, sinceRepresenting a covariance matrix, and calculating the minimum characteristic dimension:
the principal component after PCA dimensionality reduction is obtained by taking the above formula as a constraint function.
In summary, the collected user operation data set has too high dimensionality to construct an effective data model, and in the data presentation layer, a large amount of data with high latitude leads to exponential increase of the computational complexity of the data processing algorithm, even dimension explosion occurs, and the system operation efficiency is seriously affected. The PCA data dimension reduction is a feature processing and data compression method which can reduce the data dimension and simultaneously retain the main information of the original data as much as possible. The PCA dimension reduction can reserve enough information to distinguish different categories, can effectively store data information, reduces data complexity, and can help a data set to potentially expand.
Furthermore, the steps 101 to 103 can be realized by combining modules with different functions. For example, the following functional modules are set by the system:
and the core database is used for storing various data acquired by the platform and providing data bases for other modules. And the data preprocessing module is used for performing missing value filling, data redundancy removal, non-numerical data coding and other processing on the original data, performing normalization and centralization operation, and unifying a data structure so as to facilitate subsequent calculation. The data dimension reduction compression module is used for reducing the dimension of data by adopting a Principal Component Analysis (PCA) technology, reducing the data volume and providing data support for a deep learning prediction model; and the data feature extraction module performs feature extraction by adopting corresponding standards according to data types, extracts key data information and provides a data basis for big data analysis and processing.
The functional modules are only examples of the invention for implementing the steps 101 to 103, and the invention is not limited to the functional modules.
FIG. 5 is a schematic flow chart of the cluster analysis steps provided by the present invention, as shown. In the step 104, the performing cluster analysis on the feature data to obtain classification information of various operation behaviors includes:
Optionally, the K-means density clustering algorithm is to preset a threshold before clustering, calculate a weight based on the density of the feature data, an intra-cluster average distance, and an inter-cluster distance, calculate a distance of the feature data by using a weighted euclidean distance, and select an initial clustering center by using the calculated density, weight, and distance of the feature data, so as to obtain an initial input parameter of the K-means density clustering algorithm.
And 502, analyzing data based on the density of the characteristic data distribution to obtain classified data of various operation behaviors.
The characteristic data after characteristic selection is taken as a research object, user behavior operation data are analyzed and mined through a K-means (K-means clustering algorithm) density clustering algorithm, the user operation is divided into a plurality of clusters, the operation is mainly in accordance with the specification, clustering analysis is used for finding a cluster set of the compliant operation, data of illegal operation are often distributed outside the clusters, and the illegal operation behaviors can be automatically found through clustering.
The K-means density clustering algorithm is described in detail as follows:
the basic idea of the classic K-means clustering algorithm is as follows: after the clustering number k is input, firstly, randomly selecting k sample points from a data set as initial clustering centers, then calculating the distances from the sample points to the k initial clustering centers respectively, classifying the samples according to the minimum distance principle to form k clusters, then calculating the average value of each cluster to obtain a new clustering center, and continuously repeating the process until the clustering centers do not change any more or the iteration times reach a set value, and finishing the algorithm.
The K-means algorithm may use euclidean distances in calculating the distance between samples, which is calculated as follows:
wherein x in the above formula i ={x i1 ,x i2 ,…,x im And x j ={x j1 ,x j2 ,…,x jm Is a sample point with any two dimensions equal to m; x is the number of ip And the concrete value of the p-th dimension corresponding to the sample i is shown.
The invention improves the classic K-means algorithm as follows:
the classic K-means clustering algorithm has certain limitation, and as the initial clustering center of the algorithm is randomly set, the clustering result is unstable and is easy to trap human and locally optimal, and the result is easy to be influenced by noise points; the K value is required to be preset by a user before clustering, and the self-adaptability of the algorithm is poor. Aiming at the problems, the invention provides a K-means algorithm improved based on distance and weight, the calculation of the weight integrates the sample density, the intra-cluster average distance and the inter-cluster distance, the calculation of the sample distance adopts the weighted Euclidean distance, the distinguishing degree between data attributes is increased, the influence of abnormal points is reduced, and then the initial clustering center is selected through the sample density, the sample weight and the distance obtained through calculation to obtain the initial input parameters of the K-means clustering algorithm.
The method comprises the following specific steps:
step 1: for a given data set D, the density of all samples in the data set and the weight w of all sample elements in the data set D are calculated. The first initial cluster center selects the object c with the highest density in D 1 Adding it to the set C of cluster center points, where C ═ C 1 Then all distance points c in D 1 Dot deletions less than MeanDist (D).
Density calculation formula of the sample:
formula for calculating the weight w of all sample elements:
MeanDist (D) calculation formula:
step 2: is selected to have the maximum τ i =ω i ·d ω (x i ,c 1 ) Point x of value i As the 2 nd initial cluster center, is denoted as c 2 C is mixing 2 Added to the set C, when C ═ C 1 ,c 2 Similar to the first step, all distances c in D are divided 2 Dot deletions less than MeanDist (D).
And step 3: selecting the one with the maximum τ i =ω i` ·d ω (x i` ,c 2 ) Point x of value i` As the 3 rd initial cluster center, is denoted as c 3 C is mixing 3 Added to the set C, when C ═ C 1 ,c 2 ,c 3 All the distances c in D 3 Point deletions less than meandist (D), and similarly repeating the above process continuously until data set D becomes an empty set. When C is { C ═ C 1 ,c 2 ,…,c k Thus, k initial cluster centers are obtained, i.e. the sample points in the set C.
And 4, step 4: and taking the initial clustering center and the clustering number obtained in the previous steps as input, and carrying out K-means clustering operation on the given data set D until the clustering center is not changed any more.
And 5: and outputting a final clustering result.
In conclusion, the user operation normative analysis based on the density clustering algorithm can intelligently mine the user operation behavior rules, and solves the problems of high cost caused by manual review and incapability of ensuring the accuracy and real-time performance of manual prediction. The improved K-means algorithm eliminates the influence of isolated points, effectively overcomes the defects of poor noise immunity and easy falling into local optimization of the classical K-means algorithm, and improves the stability of the algorithm.
FIG. 6 is a flow chart illustrating the anomaly scoring step provided by the present invention, as shown. In step 105, the analyzing the classified data based on the anomaly detection algorithm to obtain normal data of the normal operation behavior of the user and abnormal data of the illegal operation behavior of the user includes:
Specifically, through the clustering analysis in the step 104, various operation behaviors of the user can be summarized, and operation rule logic therein is mined. The invention carries out deeper analysis on the clustering result in the last step 104, and detects whether the user operation is abnormal or not through the data. The main task of anomaly detection analysis is to extract small probability of anomaly data points in a normal user data set, and these anomaly points are not generated by random deviation, but by completely different mechanisms such as failure, threat, intrusion and the like. The frequency of these abnormal events is only a fraction of a minority compared to a large number of normal events. There are many anomaly detection algorithms, and although they are expected to separate normal data from abnormal data as much as possible, their principles are different. The invention adopts three algorithms of an isolated forest, a One-Class SVM and a local abnormal factor to complete an abnormal detection task.
The following describes three algorithms, namely an isolated forest, One-Class SVM and a local anomaly factor.
(1) Solitary forest
The solitary forest algorithm is an anomaly detection algorithm based on division and ensemble learning, and the algorithm is designed by utilizing two characteristics of anomaly data: firstly, compared with normal data, the quantity of abnormal data is very small; secondly, there is obvious difference between the attribute values of the abnormal data and the normal data. The core of the isolated forest algorithm is to randomly sample and construct a certain number of isolation trees (ifree), and the isolation trees form an isolated forest (iForest). The method mainly comprises the following steps of:
step 1: randomly selecting m sample data points from a training set consisting of a set of continuous data as a sub-sampling set D ═ D 1 ,d 2 ,…,d m And the dimension of the data point is n, and the data point serves as a root node of the tree.
Step 2: randomly selecting a dimension A and a splitting point p from the current sub-sampling set, wherein p is between the maximum value and the minimum value of the dimension A in the current sub-sampling set.
And step 3: each data d of the subsampling set i By the value d of its dimension A i (A) Division is carried out if d i (A)<p is divided into the left sub-tree, otherwise, p is divided into the right sub-tree.
And 4, step 4: repeating the steps 2 and 3, and continuously constructing new left and right subtrees until one of the following conditions is met: only one data point or a plurality of same data points are left in the step D, and the data points cannot be further divided; and the height of the isolation tree reaches a limited height.
And 5: and repeating the steps until the number of the isolation trees reaches the specified number N, and forming an isolated forest by the isolation trees.
(2)One-Class SVM
The One-Class SVM equates a classification problem to a special two-classification problem, converts the problem of separating a hyperplane and a maximum classification interval in a classical SVM feature space into the problem of maximizing the interval between the hyperplane and an origin, and converts an optimization problem into:
in the formula, omega is hyperplane normal vector, i is sample number, xi i For the relaxation variable, ρ is the hyperplane intercept, v ∈ (0, 1)]And controlling the upper bound of the boundary support vector rate and the lower bound of all support vector rates to preset negative sample proportion, wherein l is the total number of samples, and vl is a penalty coefficient. The training process of the One-Class SVM only needs the participation of positive samples, so that higher abnormal recognition rate can be ensured. Therefore, the algorithm is mainly used for estimating high-dimensional data distribution and is suitable for solving the machine learning problems of training sample screening, abnormality detection and the like under the condition of uneven numbers of positive and negative training samples.
(3) Local abnormality factor (LOF)
The LOF algorithm is to determine whether each point p is an abnormal point by the density of the point p and its neighborhood, and if the density of the point p is lower, the probability that the point p is an abnormal point is higher. Assuming that any point p is taken from the point cloud after threshold processing, the kth distance d of the point p k (p) is defined as:
d k (p)=d(p,o);
where d (p, o) is the distance between point p and point o.
Given d k (p) after (p), defining the kth distance neighborhood of p as all distances from p less than d k (p) point, i.e.
N k (p)={q∈D\{p}|d(p,q)≤d k (p)};
In the formula: n is a radical of k (p) a kth distance neighborhood of point p; q is a neighborhood point of point p; d \ p represents a collection of point clouds other than point p.
The k-th reachable distance from point to point o is:
d r (p,o)=max{d k (o),d(p,o)};
the expression above means that h points closest to the point o, to which the reachable distances are equal and equal to d k (o)。
According to the above definition, the local reachable density of a point p is expressed as:
by comparing the local reachable distance of the point p with the local reachable distance of the point o (the neighborhood point of the point p), a comparison factor, i.e., a local outlier factor, is constructed as follows, and outliers are detected:
the closer the ratio is to 1, the smaller the difference between the density of the point p and the density of the point in the neighborhood is, and the point p may belong to the same cluster as the neighborhood; the more the ratio is less than 1, the higher the density of p is than the density of the neighborhood points, and p is a dense point; the more the ratio is greater than 1, the less the density of p is than its neighborhood point density, and the more likely p is an outlier. Therefore, an appropriate value is selected for observing the LOF value, and points within the value range are reserved, namely the target point cloud after the abnormal points are removed.
Specifically, it is difficult to ensure which type of anomaly detection algorithm can obtain the optimal result for different data sources, so that integration of three algorithms, namely an isolated forest, a One Class SVM and a local anomaly factor, is adopted to comprehensively identify and evaluate various anomalous users most likely to affect the system. The invention utilizes the three algorithms to carry out the anomaly detection, and can respectively obtain the anomaly scores of all users. And weighting and normalizing the three algorithm results to obtain the final abnormal scoring ranking aiming at all users.
Each algorithm will calculate a separate anomaly score for user i. Several of three algorithms of isolated forest, One Class SVM and local abnormal factor are respectively marked as S 1 、S 2 、S 3 The corresponding weights are respectively P 1 、P 2 、P 3 Then the final anomaly Score is:
Therefore, ranking is carried out according to the final abnormal Score, and various abnormal user operations which are most likely to affect the system can be comprehensively identified and evaluated.
In conclusion, based on the user behavior analysis of the anomaly detection, the user operation compliance score is predicted by using the weighted fusion of three anomaly detection algorithms, various anomalous users most possibly influencing the system are comprehensively identified and evaluated in an integrated mode, normal data and anomalous data are separated as far as possible with higher accuracy, and the accuracy of the anomaly detection is ensured.
Further, in step 105, after the performing data analysis on the classified data based on the anomaly detection algorithm to obtain normal data of a normal operation behavior of the user and abnormal data of an illegal operation behavior of the user, the method further includes:
if the abnormal data is determined as the abnormal data of the illegal operation behaviors of the user, a system administrator and related technicians are informed in a mail and short message mode, and a disaster recovery mechanism is started for part of the abnormal data to solve the abnormal problem.
Specifically, by determining the result of the prediction in step 603, if it is predicted that there is an abnormal operation of the user, the system administrator and the corresponding technician are notified in the form of an email or a short message. Meanwhile, in order to reduce such illegal operation events which may occur subsequently, detailed information of each index of the data to be alarmed is analyzed, for example, the operation may have too many times of occurrence or too long duration, and the operation may have too many defects. And starting a disaster recovery mechanism for part of abnormal conditions by analyzing each index data, such as automatically starting some containerization services on platform standby nodes.
The method and the system can alarm the possible abnormal situation on one hand, and try to solve the abnormal situation or not by starting a disaster recovery mechanism on part of abnormal scenes on the other hand, reduce the experience of the abnormal situation on a user or strive for more time for operation and maintenance personnel to position and solve the problem.
The following describes the detection apparatus for user operation behavior data according to the present invention, and the detection apparatus for user operation behavior data described below and the detection method for user operation behavior data described above may be referred to in correspondence with each other.
Fig. 7 is a schematic structural diagram of a device for detecting user operation behavior data according to the present invention, as shown in the figure. A detection apparatus 700 for user operation behavior data comprises a data collection module 710, an entity extraction module 720, a feature selection module 730, a cluster analysis module 740, and an anomaly detection module 750. Wherein,
a data acquisition module 710, configured to acquire user operation behavior data, where the user operation behavior data is data describing various operation behaviors of a user;
an entity extraction module 720, configured to perform entity extraction on the user operation behavior data to obtain entity identification data, where the entity identification data is data related to abnormal data extracted from the user operation behavior data;
the feature selection module 730 is configured to perform feature selection and feature dimension reduction on the entity identification data to obtain feature data subjected to dimension reduction, where the feature data is data for implementing feature extraction and data compression through feature selection and feature dimension reduction;
the cluster analysis module 740 is configured to perform cluster analysis on the feature data to obtain classification data of various operation behaviors, where the classification data is used to classify various operation behaviors of a user;
and the anomaly detection module 750 is configured to perform data analysis on the classified data by using an anomaly detection algorithm to obtain normal data of a normal operation behavior of the user and abnormal data of an illegal operation behavior of the user.
Optionally, the data acquisition module 710 acquires user operation behavior data based on a first database, where the first database stores relational data and log data for recording various user operation behaviors; the user operation behavior data comprises data of one or more combinations of starting/ending time of various operations, specific steps of the operations, operation sequences and final results of the operations of the user.
Optionally, the entity extraction module 720 is further configured to perform the following steps:
marking partial data of the user operation behavior data to be used as training data, and training an entity extraction model by utilizing a neural network;
performing entity extraction on the user operation behavior data based on the entity extraction model to obtain entity identification data; wherein,
the first layer of the entity extraction model is a word embedding layer and is used for training an input word sequence into a word vector to be output;
the second layer of the entity extraction model is used for inputting the word vectors output by the first layer into a BilSTM layer for training so as to learn the relationship between words and output labels, the BilSTM layer comprises a forward LSTM network and a reverse LSTM network, and the forward LSTM network and the reverse LSTM network are connected through an output layer;
the third layer of the entity extraction model is that an attention model is arranged on an output sequence of a BilSTM layer and is used for processing a label problem so that the entity extraction model can better focus on local characteristics and highlight the important function of a keyword;
and a fourth layer of the entity extraction model is a CRF layer used after the attention mechanism and used for outputting transfer scores among the labels through a transfer matrix and obtaining an optimal label sequence based on the conversion rule of each label and the rationality of label grammar.
Optionally, the feature selection module 730 is further configured to perform the following steps:
summarizing the entity identification data and data stored in a second database, wherein the second database stores data for handling user services;
processing abnormal values/repeated values appearing in the data;
performing feature selection on the processed data, and storing the feature selection data subjected to selective filtering;
calculating a covariance matrix representing the data correlation based on the feature selection data, and performing feature decomposition on the covariance matrix to obtain a feature value and a feature vector set;
and projecting the characteristic value and the characteristic vector set to a characteristic matrix to obtain characteristic data after dimension reduction, and storing the characteristic data.
Optionally, the cluster analysis module 740 is further configured to perform the following steps:
based on a K-means density clustering algorithm, dividing the set of the characteristic data into objects belonging to different cluster classes according to characteristic similarity, wherein the method comprises the steps of distributing the data with similar characteristics in the same cluster and distributing the data with dissimilar characteristics outside the cluster;
performing data analysis based on the density of the characteristic data distribution to obtain classified data of various operation behaviors;
the K-means density clustering algorithm is characterized in that a threshold is preset before clustering, a weight is calculated based on the density of the feature data, the intra-cluster average distance and the inter-cluster distance, the distance of the feature data is calculated by adopting a weighted Euclidean distance, and an initial clustering center is selected according to the density, the weight and the distance of the feature data obtained through calculation to obtain initial input parameters of the K-means density clustering algorithm.
Optionally, the anomaly detection module 750 is further configured to perform the following steps:
respectively carrying out anomaly scoring on the classified data by adopting three anomaly detection algorithms of an isolated forest, a One Class SVM and a local anomaly factor to obtain corresponding anomaly scoring values;
performing weighted normalization on the abnormal score values output by the three abnormal detection algorithms to obtain the ranking of the abnormal score values for all users;
and determining normal data of the normal operation behaviors of the user and abnormal data of the illegal operation behaviors of the user according to the ranking of the abnormal score values.
Further, the device 700 for detecting user operation behavior data further includes a system alarm module (not shown).
And the alarm module is used for informing a system administrator and related technical personnel in a mail and short message mode if the abnormal data of the illegal operation behaviors of the user is determined, and starting a disaster recovery mechanism for part of the abnormal data to solve the abnormal problem.
Fig. 8 illustrates a physical structure diagram of an electronic device, and as shown in fig. 8, the electronic device may include: a processor (processor)810, a communication Interface 820, a memory 830 and a communication bus 840, wherein the processor 810, the communication Interface 820 and the memory 830 communicate with each other via the communication bus 840. The processor 810 may call logic instructions in the memory 830 to perform a method of detecting the user operation behavior data, the method comprising:
collecting user operation behavior data, wherein the user operation behavior data is used for analyzing whether the operation behavior of a user is abnormal or not;
performing entity extraction on the user operation behavior data to obtain entity identification data, wherein the entity identification data is used for extracting data related to abnormal operation behaviors of users;
performing feature selection and feature dimension reduction on the entity identification data to obtain feature data subjected to dimension reduction, wherein the feature data are data for realizing feature extraction and data compression through feature selection and feature dimension reduction;
performing clustering analysis on the characteristic data to obtain classification data of various operation behaviors, wherein the classification data is used for classifying the various operation behaviors of the user;
and performing data analysis on the classified data by adopting an anomaly detection algorithm to obtain normal data of normal operation behaviors of the user and abnormal data of abnormal operation behaviors of the user.
In addition, the logic instructions in the memory 830 may be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product, the computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions, which when executed by a computer, enable the computer to perform the method for detecting the user operation behavior data provided by the above methods, the method comprising:
acquiring user operation behavior data, wherein the user operation behavior data is used for analyzing whether the operation behavior of a user is abnormal or not;
performing entity extraction on the user operation behavior data to obtain entity identification data, wherein the entity identification data is used for extracting data related to abnormal operation behaviors of users;
performing feature selection and feature dimension reduction on the entity identification data to obtain feature data subjected to dimension reduction, wherein the feature data are data for realizing feature extraction and data compression through feature selection and feature dimension reduction;
performing clustering analysis on the characteristic data to obtain classification data of various operation behaviors, wherein the classification data is used for classifying the various operation behaviors of the user;
and performing data analysis on the classified data by adopting an anomaly detection algorithm to obtain normal data of normal operation behaviors of the user and abnormal data of abnormal operation behaviors of the user. In yet another aspect, the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, the computer program being implemented by a processor to perform the method for detecting the user operation behavior data provided in the above aspects, the method including:
collecting user operation behavior data, wherein the user operation behavior data is used for analyzing whether the operation behavior of a user is abnormal or not;
performing entity extraction on the user operation behavior data to obtain entity identification data, wherein the entity identification data is used for extracting data related to abnormal operation behaviors of the user;
performing feature selection and feature dimension reduction on the entity identification data to obtain feature data subjected to dimension reduction, wherein the feature data are data for realizing feature extraction and data compression through feature selection and feature dimension reduction;
performing clustering analysis on the characteristic data to obtain classification data of various operation behaviors, wherein the classification data is used for classifying the various operation behaviors of the user;
and performing data analysis on the classified data by adopting an anomaly detection algorithm to obtain normal data of normal operation behaviors of the user and abnormal data of abnormal operation behaviors of the user.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.
Claims (10)
1. A method for detecting user operation behavior data is characterized by comprising the following steps:
collecting user operation behavior data, wherein the user operation behavior data is used for analyzing whether the operation behavior of a user is abnormal or not;
performing entity extraction on the user operation behavior data to obtain entity identification data, wherein the entity identification data is used for extracting data related to abnormal operation behaviors of users;
performing feature selection and feature dimension reduction on the entity identification data to obtain feature data subjected to dimension reduction, wherein the feature data are data for realizing feature extraction and data compression through feature selection and feature dimension reduction;
performing clustering analysis on the characteristic data to obtain classification data of various operation behaviors, wherein the classification data is used for classifying the various operation behaviors of the user;
and performing data analysis on the classified data by adopting an anomaly detection algorithm to obtain normal data of normal operation behaviors of the user and abnormal data of abnormal operation behaviors of the user.
2. The method for detecting user operation behavior data according to claim 1, wherein the collecting user operation behavior data comprises:
acquiring user operation behavior data based on a first database, wherein the first database stores relational data and log data for recording various operation behaviors of a user;
the user operation behavior data comprises data of one or more combinations of starting/ending time of various operations, specific steps of the operations, operation sequences and final results of the operations of the user.
3. The method for detecting user operation behavior data according to claim 1, wherein the performing entity extraction on the user operation behavior data to obtain entity identification data includes:
marking partial data of the user operation behavior data to be used as training data, and training an entity extraction model by utilizing a neural network;
performing entity extraction on the user operation behavior data based on the entity extraction model to obtain entity identification data; wherein,
the first layer of the entity extraction model is a word embedding layer and is used for training an input word sequence into a word vector to be output;
the second layer of the entity extraction model is used for inputting the word vectors output by the first layer into a BilSTM layer for training so as to learn the relationship between words and output labels, the BilSTM layer comprises a forward LSTM network and a reverse LSTM network, and the forward LSTM network and the reverse LSTM network are connected through an output layer;
the third layer of the entity extraction model is that an attention model is arranged on an output sequence of a BilSTM layer and is used for processing a label problem so that the entity extraction model can better focus on local characteristics and highlight the important function of a keyword;
and a fourth layer of the entity extraction model is a CRF layer used after the attention mechanism and used for outputting transfer scores among the labels through a transfer matrix and obtaining an optimal label sequence based on the conversion rule of each label and the rationality of label grammar.
4. The method for detecting user operation behavior data according to claim 1, wherein the performing feature selection and feature dimension reduction on the entity identification data to obtain feature data after dimension reduction comprises:
summarizing the entity identification data and data stored in a second database, wherein the second database stores data for handling user services;
processing abnormal values/repeated values appearing in the data;
performing feature selection on the processed data, and storing feature selection data subjected to selection filtering;
calculating a covariance matrix representing the data correlation based on the feature selection data, and performing feature decomposition on the covariance matrix to obtain a feature value and a feature vector set;
and projecting the characteristic value and the characteristic vector set to a characteristic matrix to obtain characteristic data after dimension reduction, and storing the characteristic data.
5. The method for detecting user operation behavior data according to claim 1, wherein the step of performing cluster analysis on the feature data to obtain classification information of various operation behaviors comprises:
based on a K-means density clustering algorithm, dividing the set of the characteristic data into objects belonging to different cluster classes according to characteristic similarity, wherein the method comprises the steps of distributing the data with similar characteristics in the same cluster and distributing the data with dissimilar characteristics outside the cluster;
performing data analysis based on the density of the characteristic data distribution to obtain classified data of various operation behaviors;
the K-means density clustering algorithm is characterized in that a threshold is preset before clustering, a weight is calculated based on the density of the feature data, the intra-cluster average distance and the inter-cluster distance, the distance of the feature data is calculated by adopting a weighted Euclidean distance, and an initial clustering center is selected according to the density, the weight and the distance of the feature data obtained through calculation to obtain initial input parameters of the K-means density clustering algorithm.
6. The method for detecting the user operation behavior data according to claim 1, wherein the step of performing data analysis on the classified data based on an anomaly detection algorithm to obtain normal data of a normal operation behavior of the user and abnormal data of an illegal operation behavior of the user comprises the steps of:
respectively carrying out anomaly scoring on the classified data by adopting three anomaly detection algorithms of an isolated forest, a One Class SVM and a local anomaly factor to obtain corresponding anomaly scoring values;
performing weighted normalization on the abnormal score values output by the three abnormal detection algorithms to obtain the ranking of the abnormal score values for all users;
and determining normal data of the normal operation behaviors of the user and abnormal data of the illegal operation behaviors of the user according to the ranking of the abnormal score values.
7. The method for detecting the user operation behavior data according to claim 1, wherein after the data analysis is performed on the classified data based on the anomaly detection algorithm to obtain normal data of a normal operation behavior of the user and abnormal data of an illegal operation behavior of the user, the method further comprises:
if the abnormal data is determined as the abnormal data of the illegal operation behaviors of the user, a system administrator and related technicians are informed in a mail and short message mode, and a disaster recovery mechanism is started for part of the abnormal data to solve the abnormal problem.
8. A detection device for user operation behavior data is characterized by comprising:
the data acquisition module is used for acquiring user operation behavior data, wherein the user operation behavior data is data describing various operation behaviors of a user;
the entity extraction module is used for performing entity extraction on the user operation behavior data to obtain entity identification data, wherein the entity identification data is data related to abnormal data extracted from the user operation behavior data;
the characteristic selection module is used for carrying out characteristic selection and characteristic dimension reduction on the entity identification data to obtain characteristic data subjected to dimension reduction, and the characteristic data are data for realizing characteristic extraction and data compression through characteristic selection and characteristic dimension reduction;
the cluster analysis module is used for carrying out cluster analysis on the characteristic data to obtain classification data of various operation behaviors, and the classification data is used for classifying the various operation behaviors of the user;
and the abnormality detection module is used for carrying out data analysis on the classified data by adopting an abnormality detection algorithm to obtain normal data of normal operation behaviors of the user and abnormal data of illegal operation behaviors of the user.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the method for detecting user operation behavior data according to any one of claims 1 to 7 when executing the program.
10. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method for detecting user operation behavior data according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110251231.XA CN115048464A (en) | 2021-03-08 | 2021-03-08 | User operation behavior data detection method and device and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110251231.XA CN115048464A (en) | 2021-03-08 | 2021-03-08 | User operation behavior data detection method and device and electronic equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115048464A true CN115048464A (en) | 2022-09-13 |
Family
ID=83156520
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110251231.XA Pending CN115048464A (en) | 2021-03-08 | 2021-03-08 | User operation behavior data detection method and device and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115048464A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115454988A (en) * | 2022-09-27 | 2022-12-09 | 哈尔滨工业大学 | Satellite power supply system missing data completion method based on random forest network |
CN116245513A (en) * | 2023-05-11 | 2023-06-09 | 深圳市联合信息技术有限公司 | Automatic operation and maintenance system and method based on rule base |
CN116668192A (en) * | 2023-07-26 | 2023-08-29 | 国网山东省电力公司信息通信公司 | Network user behavior anomaly detection method and system |
CN116860977A (en) * | 2023-08-21 | 2023-10-10 | 之江实验室 | Abnormality detection system and method for contradiction dispute mediation |
CN118054971A (en) * | 2024-04-11 | 2024-05-17 | 南京中科齐信科技有限公司 | Isolation system based on intelligent analysis of industrial network communication behaviors |
CN118171129A (en) * | 2024-05-11 | 2024-06-11 | 中移(苏州)软件技术有限公司 | User data acquisition method, system, electronic device, chip and medium |
-
2021
- 2021-03-08 CN CN202110251231.XA patent/CN115048464A/en active Pending
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115454988A (en) * | 2022-09-27 | 2022-12-09 | 哈尔滨工业大学 | Satellite power supply system missing data completion method based on random forest network |
CN115454988B (en) * | 2022-09-27 | 2023-05-23 | 哈尔滨工业大学 | Satellite power supply system missing data complement method based on random forest network |
CN116245513A (en) * | 2023-05-11 | 2023-06-09 | 深圳市联合信息技术有限公司 | Automatic operation and maintenance system and method based on rule base |
CN116245513B (en) * | 2023-05-11 | 2023-07-07 | 深圳市联合信息技术有限公司 | Automatic operation and maintenance system and method based on rule base |
CN116668192A (en) * | 2023-07-26 | 2023-08-29 | 国网山东省电力公司信息通信公司 | Network user behavior anomaly detection method and system |
CN116668192B (en) * | 2023-07-26 | 2023-11-10 | 国网山东省电力公司信息通信公司 | Network user behavior anomaly detection method and system |
CN116860977A (en) * | 2023-08-21 | 2023-10-10 | 之江实验室 | Abnormality detection system and method for contradiction dispute mediation |
CN116860977B (en) * | 2023-08-21 | 2023-12-08 | 之江实验室 | Abnormality detection system and method for contradiction dispute mediation |
CN118054971A (en) * | 2024-04-11 | 2024-05-17 | 南京中科齐信科技有限公司 | Isolation system based on intelligent analysis of industrial network communication behaviors |
CN118054971B (en) * | 2024-04-11 | 2024-06-21 | 南京中科齐信科技有限公司 | Isolation system based on intelligent analysis of industrial network communication behaviors |
CN118171129A (en) * | 2024-05-11 | 2024-06-11 | 中移(苏州)软件技术有限公司 | User data acquisition method, system, electronic device, chip and medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN115048464A (en) | User operation behavior data detection method and device and electronic equipment | |
CN108959431B (en) | Automatic label generation method, system, computer readable storage medium and equipment | |
CN110059181B (en) | Short text label method, system and device for large-scale classification system | |
CN110825877A (en) | Semantic similarity analysis method based on text clustering | |
CN106951498A (en) | Text clustering method | |
CN111831790A (en) | False news identification method based on low threshold integration and text content matching | |
CN109471944A (en) | Training method, device and the readable storage medium storing program for executing of textual classification model | |
Karthikeyan et al. | Probability based document clustering and image clustering using content-based image retrieval | |
CN111428028A (en) | Information classification method based on deep learning and related equipment | |
Lin et al. | Effective feature space reduction with imbalanced data for semantic concept detection | |
Assery et al. | Comparing learning-based methods for identifying disaster-related tweets | |
CN110377690A (en) | A kind of information acquisition method and system based on long-range Relation extraction | |
CN114328800A (en) | Text processing method and device, electronic equipment and computer readable storage medium | |
Alam et al. | Social media content categorization using supervised based machine learning methods and natural language processing in bangla language | |
CN111144453A (en) | Method and equipment for constructing multi-model fusion calculation model and method and equipment for identifying website data | |
Yafooz et al. | Enhancing multi-class web video categorization model using machine and deep learning approaches | |
Khan et al. | Bengali crime news classification based on newspaper headlines using NLP | |
Pandey et al. | A hierarchical clustering approach for image datasets | |
CN117009596A (en) | Identification method and device for power grid sensitive data | |
CN111767404A (en) | Event mining method and device | |
CN116738068A (en) | Trending topic mining method, device, storage medium and equipment | |
Akhgari et al. | Sem-TED: semantic twitter event detection and adapting with news stories | |
Sami et al. | Incorporating random forest trees with particle swarm optimization for automatic image annotation | |
CN115409433B (en) | Depth NLP-based method and device for analyzing important community personnel portrait | |
Watcharapinchai et al. | Dimensionality reduction of SIFT using PCA for object categorization |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |