CN113158076A - Social robot detection method based on variational self-coding and K-nearest neighbor combination - Google Patents
Social robot detection method based on variational self-coding and K-nearest neighbor combination Download PDFInfo
- Publication number
- CN113158076A CN113158076A CN202110364341.7A CN202110364341A CN113158076A CN 113158076 A CN113158076 A CN 113158076A CN 202110364341 A CN202110364341 A CN 202110364341A CN 113158076 A CN113158076 A CN 113158076A
- Authority
- CN
- China
- Prior art keywords
- sample
- samples
- data
- network
- coding
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 29
- 238000000034 method Methods 0.000 claims abstract description 25
- 238000012549 training Methods 0.000 claims abstract description 18
- 230000002159 abnormal effect Effects 0.000 claims abstract description 16
- 230000008569 process Effects 0.000 claims abstract description 12
- 238000007781 pre-processing Methods 0.000 claims abstract description 4
- 238000009826 distribution Methods 0.000 claims description 18
- 239000011159 matrix material Substances 0.000 claims description 10
- 238000004364 calculation method Methods 0.000 claims description 8
- 230000006870 function Effects 0.000 claims description 8
- 238000004422 calculation algorithm Methods 0.000 claims description 6
- 238000010606 normalization Methods 0.000 claims description 6
- 238000005070 sampling Methods 0.000 claims description 6
- 238000005457 optimization Methods 0.000 claims description 4
- 230000004927 fusion Effects 0.000 claims description 3
- 230000014509 gene expression Effects 0.000 claims description 3
- 238000011156 evaluation Methods 0.000 claims description 2
- 238000013507 mapping Methods 0.000 claims description 2
- 238000013480 data collection Methods 0.000 abstract description 2
- 230000007547 defect Effects 0.000 abstract 1
- 238000002372 labelling Methods 0.000 abstract 1
- 230000006399 behavior Effects 0.000 description 9
- 241000282414 Homo sapiens Species 0.000 description 6
- 238000011160 research Methods 0.000 description 5
- 230000000694 effects Effects 0.000 description 3
- 230000002354 daily effect Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 230000036541 health Effects 0.000 description 2
- 238000012417 linear regression Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 241000282412 Homo Species 0.000 description 1
- 206010028980 Neoplasm Diseases 0.000 description 1
- 241000700605 Viruses Species 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 230000006378 damage Effects 0.000 description 1
- 238000003066 decision tree Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 239000000843 powder Substances 0.000 description 1
- 238000007637 random forest analysis Methods 0.000 description 1
- 230000011273 social behavior Effects 0.000 description 1
- 238000003892 spreading Methods 0.000 description 1
- 230000007480 spreading Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 229960005486 vaccine Drugs 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
- 238000012800 visualization Methods 0.000 description 1
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9536—Search customisation based on social or collaborative filtering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3335—Syntactic pre-processing, e.g. stopword elimination, stemming
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
- G06F18/24155—Bayesian classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/088—Non-supervised learning, e.g. competitive learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/01—Social networking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Computational Linguistics (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Probability & Statistics with Applications (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Business, Economics & Management (AREA)
- Human Resources & Organizations (AREA)
- Economics (AREA)
- Marketing (AREA)
- Primary Health Care (AREA)
- Strategic Management (AREA)
- Tourism & Hospitality (AREA)
- General Business, Economics & Management (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
A social robot detection method based on variation self-coding and K neighbor combination belongs to the technical field of anomaly detection, and the method comprises the steps of obtaining public data of a social robot through a network, extracting characteristics through preprocessing, training by adopting data, coding and decoding by using variation self-coding, enabling characteristics of normal samples to be more similar to initial characteristics through decoding, enabling abnormal samples to have large difference with the initial characteristics, fusing the original characteristics and the decoded characteristics, and performing anomaly detection by using a K neighbor anomaly detection method. The method considers that in a social network large environment, the number of abnormal user groups is smaller than that of normal user groups, and therefore in the data collection process, collection of abnormal users is relatively troublesome. The method provided by the invention overcomes the defects of high-cost labeling and unbalanced positive and negative samples in the existing social network robot detection method, and realizes the high-efficiency detection of the social network robot user by reducing the participation of abnormal samples in the model training.
Description
Technical Field
The invention belongs to the technical field of anomaly detection, and particularly relates to social robot detection based on differential self-encoding.
Background
With the great popularization and development of the internet, a large amount of real online user behavior data is provided for researching human behaviors. As 12 months in 2020, the scale of netizens in China reaches 9.89 hundred million, Twitter daily active users reach 1.92 hundred million people, as 9 months in 2020, microblog monthly active users reach 5.11 hundred million, the average daily active user number is 2.24 hundred million, so that huge user quantity generates TB-level data every day, and the data records the abundant internet surfing behaviors of thousands of universal users. Social media has become an indispensable part of people's life to acquire and share information. In summary, social media websites such as Twitter and microblog bring new opportunities to us, and research whether behaviors of users deviate from a normal social mode from objective behavior data, so as to detect users who break network security.
Most people now like and are willing to express emotion, record life and actively make a statement on a mass social media platform, and the whole social network tends to be gradually complicated and diversified, and the problems therewith emerge endlessly. At present, a social robot (i.e. an automated program simulating the behavior of a real normal user in a social network) for various purposes has been created, the creation of the social robot is the aim of serving and improving the quality of life of human beings, but the development of the social robot breaks away from the control of human beings, the social robot can be disguised as an independent entity, some false accounts are created, activities of stealing user privacy, sending spam, spreading malicious links, launching DDoS attacks and the like are implemented, injuries are caused to innocent users, and the social robot becomes a large virus tumor in the social network and harms the health of the social network. According to the report of the us securities and exchange commission, over 2300 million active accounts on Twitter in 2014 are actually social robots, which have become important content production and transmission power in social media. A bed Bot report in 2020, issued by the network security service provider, that concerns the current situation of automated network traffic, indicates that in 2019, malicious machine traffic accounts for more than 24.1%, good machine total traffic accounts for more than 13.1%, human traffic increases by 1.1% in the last year, and total accounts for more than 62.8%, as shown in fig. 1. The robots mentioned in the reports often appear in the form of botnets hiding their traffic originating sources through anonymous proxies and other identity hiding techniques, while disguising themselves as legitimate humans. It is this property that makes them difficult to control. The problem of detecting the robot has a strong meaning. For example, it is a challenging and meaningful task for the robot to detect social robots that influence political elections by distorting network opinions, manipulating stock markets, or pushing anti-vaccine conspiracy opinions that lead to health epidemics.
A social robot is a program that mimics human social behavior. Early detection of bad users in a social network mainly focuses on water army, junk users and zombie powder, along with the appearance of machine users, all circles are aware of negative effects brought by malicious social robots, and because the appearance time of the machine users is late, the research on the machine users is relatively less, and the related research starts late. Researchers have classified social networking users into human users, normal machine users, and malicious machine users. The probability that normal machine users engage in malicious behaviors is small, behavior characteristics are more similar to those of the normal users, and the behavior characteristics are obviously different from those of malicious robots, so that the normal robots can be defined as normal users, the malicious machine users can be defined as machine users, and the detection of malicious social robots can be regarded as a classification problem: if a user is a malicious machine user, it is considered a positive example in the training set, otherwise, the user is a normal user, which is considered a negative example. Most researches consider detection machine users as classification problems, for example, Random Forest models (RF), AdaBoost, linear Regression models (LR) and Decision Tree models (DT) are used as classifiers for prediction respectively, but a classification-based method needs to be trained in advance, and is relatively dependent on the accuracy of training data and various data labels, and an effective scheme is lacking in the problem of category imbalance. The current abnormity detection research result is very remarkable, and the method is more suitable for detecting abnormal users in the social network.
Disclosure of Invention
In a social network large environment, the number of abnormal user groups is small relative to the normal user groups, so the collection of abnormal users is relatively troublesome in the data collection process. In order to reduce the training of abnormal sample participation models, the invention provides a social robot detection method based on the combination of Variational self-encoding (VAE) and abnormal detection, which adopts data to train and then uses the Variational self-encoding to encode and decode, the characteristics of normal samples are more similar to the initial characteristics after decoding, the abnormal samples have large difference with the initial characteristics, the original characteristics and the decoded characteristics are fused, and then the abnormal detection method is used for abnormal detection.
Drawings
FIG. 1 machine flow rate ratio case;
FIG. 2 is a flow chart of the present invention;
FIG. 3 a variation self-coding structure;
FIG. 4 a variation self-coding codec visualization;
Detailed Description
As shown in fig. 2, the invention provides a social robot detection method based on a combination of variational self-coding and anomaly detection, and the inventive method comprises the following steps: step 1, data acquisition and preprocessing, namely processing original text data acquired in a network by using a program to obtain an original characteristic matrix; step 2, generating characteristics through variation self-coding of a depth generation model; and 3, carrying out feature fusion on the original features and the generated features, and detecting the social robot by using an anomaly detection method.
Step 1, data acquisition and preprocessing, namely processing original text data acquired from a network by using a program to obtain an original characteristic matrix
The disclosed social robot data is very little, the invention selects a disclosed CLEF2019 data set with labels, wherein 2880 training sets, 1240 verification sets and 100 tweets per account are adopted, all accounts are marked as robots and normal users (including gender marks), so that the total normal users are 2060, and the machine users are 2060. The social robot data used by the invention is represented asN4020 denotes the number of samples, and i denotes a sample.
After an original data set is obtained, text cleaning is carried out by adopting a powerful natural language processing library NLTK in Python, an open-source third-party Python toolkit-Gensim is used for calculating text similarity between texts sent by each user, and the toolkit is used for unsupervised learning of topic vector expression of a text hidden layer from an original unstructured text. The method supports various topic model algorithms including TF-IDF, LSA, LDA and word2vec, supports streaming training and provides API (application programming interface) interfaces of some common tasks such as similarity calculation, information retrieval and the like. After program processing, the invention extracts 16-dimensional features in total, as follows:
mention of @ proportion of others
Number of average used expressions in tweets;
the number of stop words contained in the tweet on average;
eight dimensions total of the average number of 8 symbols: "#", "," -, ","; ","! "," (",") "; the number of URLs contained in the sent message on average;
the average length of the original tweet;
forwarding the average length of the tweet;
the tweet forwarding amount proportion;
tweet average similarity.
After the 16-dimensional features are obtained, normalization is carried out on each-dimensional feature of the sample, and the normalization formula is as follows:
in the above formula, l represents the characteristic dimension of the sample i in the characteristic matrix, the value range of l is 0 to 15, lmax is the maximum value in the dimension of the sample l, lmin is the minimum value in the dimension of the sample l, and the characteristic data set after normalization is represented as
Step 2, generating characteristics through a depth generation model
A Variational autoencoder, which is a form of a depth generative model, is a generative network structure inferred based on Variational Bayes (VB). The structure is shown in fig. 3, and the variational self-coding establishes two probability density distribution models by using two neural networks: a variation probability distribution for generating hidden variable used for variation inference of original input data, called inference network; and the other one is used for restoring and generating approximate probability distribution of the original data according to the generated hidden variable variation probability distribution, and is called as a generation network.
From the original sample set obtained in step 1 asEach data sample xiAre randomly generated independent, continuous or discrete distribution variables, and the observable variable X is in a high-dimensional spaceRandom vector as input visible layer variable, then hidden layer variable Z is generated, non-observable variable Z is a random vector of relatively low dimension space, and data set is generatedX*The method represents a sample set obtained by encoding and decoding an original sample set through variational self-coding, wherein a variational self-coding generation model can be divided into two processes:
(1) approximate inference process of hidden variable Z posterior distribution: identifying a model qφ(z | x), i.e. the inference network, qφ(z | x) represents one process in which x is known to infer z.
(2) Generating variable X*The condition distribution generation process of (1): conditional distribution pθ(z)pθ(x*I z), i.e. a network is generated.
The core of the variational self-coding is to make qφ(Z | x) and true posterior distribution pθ(z | x) are approximately equal, the optimization goals of the problem transformation into the inference network and the generation network are to maximize a variation lower bound function, the log in the following formula is a logarithm with the base 10, theta (generation network parameter) and phi (inference network parameter) are parameters of the network, the network is initialized before being trained, and then the parameters are updated by training. And L (theta, phi; X) is a variation lower bound function, and the parameters theta and phi are solved by a known sample set X:
zi=μi+εi·δi
in the above equation, argmax represents the maximum variation lower bound function L,representing data generated by the correspondence of samples i, ziDenotes the hidden variable, μ, corresponding to sample iiRepresenting the mean, δ, of samples i in the inferred networkiThe method is adopted for sampling, in order to finish sampling Z, an auxiliary parameter epsilon is introduced, the parameter is obtained by random sampling from a standard normal distribution N (0,1), epsiloniRepresenting the generation of hidden layer variable z by mapping samples iiData sampled randomly. With the introduction of auxiliary parameters, the relation between the hidden variable Z and the mean variance is changed from sampling calculation to numerical calculation, and the optimization can directly adopt random gradient descent and conditional distributionObeying a bernoulli or gaussian distribution represents one process in which z is known to infer x for sample i in the generation network. And directly calculating according to a probability density function formula. Then each item of the lower bound of the variation can be directly calculated, parameters theta and phi of all visible units and hidden units are naturally updated according to training, the model structure is finally determined according to the parameters theta and phi, and corresponding data can be generated according to input data. We can visualize the variational self-encoding codec as fig. 4, with encoder input features of 16, hidden variable dimensions of 4, decoder output features of 16, training batch of 2000.
Step 3, after the original characteristic-generated characteristic is subjected to characteristic fusion, detecting the social robot by using an anomaly detection method
The original sample characteristics are coded and decoded through the variational self-coding in the step 2, the normal sample characteristics are more similar to the initial characteristics through decoding, the abnormal sample has large difference with the initial characteristics, and the original characteristic matrix X obtained in the step 1 and the decoded new characteristic X are used*The matrix is fused to obtainAs an input to the anomaly detection section.
The abnormal detection method selected by the invention is a k nearest neighbor algorithm, and the KNN algorithm is to give a training number
According to the data set, for a new input instance, k instances which are nearest to the instance are found in the training data set, and if the majority of the k instances belong to a certain class, the input instance is divided into the class. The method comprises the following specific steps:
1. a sample data set (training sample set) with labels is obtained, wherein the sample data set comprises the corresponding relation between each piece of data and the corresponding classification.
2. After inputting new data without labels, each feature of the new data is compared with the corresponding feature of the data in the sample set.
1) And calculating the distance between the new data and each piece of data in the sample data set.
2) All distances found are sorted (from small to large, smaller means more similar).
3) And taking classification labels corresponding to the first k (k is generally less than or equal to 20) sample data.
3. And solving the classification label with the largest occurrence frequency in the k data as the classification of the new data.
The distance measurement of the invention selects common Euclidean distance, wherein the distance calculation formula between two points in the multidimensional space is as follows:
in the formula, d (y)i,yj) Representing the euclidean distance between sample i and sample j,an nth-dimension feature value representing the sample i,and representing the N-th dimension characteristic value of the sample j, i is more than or equal to 1 and less than or equal to N, and j is more than or equal to 1 and less than or equal to N.
In addition, the classification decision is usually to pick the label with the most votes by a few obedients to the majority in the classification problem, and is usually the average of the labels of the K nearest neighbors in the regression problem, and the invention also selects the average. In addition, in the experiment of the invention, the effect is better when K is 5, the training time and AUC are used as evaluation indexes in the experiment, and the AUC is the area under the ROC curve (characteristic curve of the operation of the testee). The AUC algorithm is the ratio of the positive samples in all positive and negative sample pairs to the logarithm of the positive samples in the positive samples before the negative samples, i.e., the probability value. The AUC calculation formula is as follows.
In the above formula, M represents the number of positive samples, P represents the number of negative samples, and the specific way of AUC statistics is to sort the probability values from large to small, then let rank of the sample corresponding to the maximum probability value be N, N be the number of samples, rank of the sample corresponding to the second maximum value is N-1, and so on. Then, rank of all positive class samples is added, and the M value with the minimum score of the positive class samples is subtracted. What results is how much of all the samples have a score for the positive type that is greater than the score for the negative type. And finally divided by mxp. The experimental results of the invention are as follows:
AUC | time | |
VAE-KNN | 0.9287 | 0.1157 |
VAE-Mean_KNN | 0.941 | 0.1077 |
VAE-Media_knn | 0.9351 | 0.1396 |
Claims (2)
1. a social robot detection method based on variation self-coding and K neighbor combination is characterized by comprising the following steps:
step 1, data acquisition and preprocessing, wherein acquired original data are processed by a program to obtain an original characteristic matrix;
step 2, generating characteristics through a depth generation model;
step 3, after the original features and the generated features are subjected to feature fusion, detecting the social robot by using an anomaly detection method;
the method specifically comprises the following steps: firstly, an original social robot data set is obtained and processed to obtain a characteristic matrix expressed asi is a sample; the features extracted are as follows:
mention of @ proportion of others
Number of average used expressions in tweets;
the number of stop words contained in the tweet on average;
eight dimensions total of the average number of 8 symbols: "#", "," -, ","; ","! "," (",") ";
the number of URLs contained in the sent message on average;
the average length of the original tweet;
forwarding the average length of the tweet;
the tweet forwarding amount proportion;
tweet average similarity;
after the characteristics are obtained, normalization is carried out on each dimension characteristic of the sample, and the normalization formula is as follows:
in the above formula, l represents the characteristic dimension of the sample i in the characteristic matrix, lmax is the maximum value in the dimension of the sample data l, lmin is the minimum value in the dimension of the sample data l, and the characteristic data set after normalization is represented as
Using a variational auto-encoder (VAE) as a depth generation model, a sample set ofEach data sample xiAll are randomly generated independent, continuous or discrete distributed variables, the observable variable X is used as an input visible layer variable, then a hidden layer variable Z is generated, and a data set is generatedX*The method comprises the following steps of representing a sample set obtained after an original sample set is coded and decoded through variational self-coding, wherein a variational self-coding generation model is divided into two processes:
(1) approximate inference process of hidden variable Z posterior distribution: identifying a model qφ(z | x), i.e. the inference network, qφ(z | x) represents a process where x is known to infer z;
(2) generating variable X*The condition distribution generation process of (1): conditional distribution pθ(z)pθ(x*| z), namely generating a network;
the optimization targets of the inference network and the generation network are both maximization variational lower bound functions, log in the formula represents logarithm with base 10, theta generation network parameters and phi inference network parameters are parameters of the network, the network is initialized before being trained, and then the parameters are updated by training; and L (theta, phi; X) is a variation lower bound function, and the parameters theta and phi are solved by a known sample set X:
zi=μi+εi·δi
argmax represents the maximum variation lower bound function L,representing data generated by the correspondence of samples i, ziRepresenting hidden variables corresponding to the samples i, mu i representing the mean value of the samples i in the inference network, delta i representing the variance of the samples i in the inference network, introducing an auxiliary parameter epsilon, wherein the parameter is obtained by random sampling from a standard normal distribution N (0,1), and epsiloniRepresenting the generation of hidden layer variable z by mapping samples iiData sampled randomly; with the introduction of auxiliary parameters, the relation between the hidden variable Z and the mean variance is changed from sampling calculation to numerical calculation, the optimization directly adopts the random gradient descent and the condition distributionObeying a bernoulli or gaussian distribution to represent a process of inferring x for z for sample i, known in the generation network; directly calculating according to a probability density function formula of the target; then each term of the lower bound of the variation is directly calculated, and the parameters theta and phi of all visible units and hidden units are naturally updated according to the training, so that the model structure is finally determined.
2. The social robot detection method based on the combination of variational self-coding and K neighbors of claim 1, wherein the variational self-coding encodes and decodes original sample features, and the original sample features are decodedFeature matrix X and decoded new feature X*The matrixes are fused to obtain a matrixThe abnormal detection of a non-parametric and inert k nearest neighbor algorithm is very suitable for abnormal detection, for a given training data set, for a new input example, the algorithm needs to find k examples which are nearest to the example in the training data set, if the majority of the k examples belong to a certain class, the input example is divided into the class, the value of k is greater than 0, and the upper limit is less than or equal to 20; the distance metric for both instances is calculated as Euclidean distance, where the distance between two points in the multidimensional space is calculated as follows:
in the formula, d (y)i,yj) Representing the euclidean distance between sample i and sample j,an nth-dimension feature value representing the sample i,representing the nth dimension characteristic value of the sample j, i is more than or equal to 1 and less than or equal to N, and j is more than or equal to 1 and less than or equal to N;
in addition, in the classification problem, the regression problem is the average value of the labels of the K nearest neighbors, namely the average value of K nearest examples calculated by the distance; in addition, the AUC is taken as an evaluation index, the AUC is the area under the operating characteristic curve of the ROC curve subject, and the AUC is the proportion of the positive sample in front of the negative sample in all the positive and negative sample pairs, which accounts for the logarithm of the sample, that is, the probability value; the AUC calculation formula is as follows;
in the above formula, M represents the number of positive samples, P represents the number of negative samples, and the specific way of AUC statistics is to sort the probability values from large to small, then let rank of the sample corresponding to the maximum probability value be N, N be the number of samples, rank of the sample corresponding to the second maximum value is N-1, and so on; then adding rank of all positive samples, and subtracting the M value with the minimum score of the positive samples; what is obtained is how much of all samples the score for the positive type is greater than the score for the negative type; and finally divided by mxp.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110364341.7A CN113158076B (en) | 2021-04-05 | 2021-04-05 | Social robot detection method based on variational self-coding and K-nearest neighbor combination |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110364341.7A CN113158076B (en) | 2021-04-05 | 2021-04-05 | Social robot detection method based on variational self-coding and K-nearest neighbor combination |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113158076A true CN113158076A (en) | 2021-07-23 |
CN113158076B CN113158076B (en) | 2022-07-22 |
Family
ID=76888430
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110364341.7A Expired - Fee Related CN113158076B (en) | 2021-04-05 | 2021-04-05 | Social robot detection method based on variational self-coding and K-nearest neighbor combination |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113158076B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114330370A (en) * | 2022-03-17 | 2022-04-12 | 天津思睿信息技术有限公司 | Natural language processing system and method based on artificial intelligence |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106682118A (en) * | 2016-12-08 | 2017-05-17 | 华中科技大学 | Social network site false fan detection method achieved on basis of network crawler by means of machine learning |
CN111556016A (en) * | 2020-03-25 | 2020-08-18 | 中国科学院信息工程研究所 | Network flow abnormal behavior identification method based on automatic encoder |
EP3719711A2 (en) * | 2020-07-30 | 2020-10-07 | Institutul Roman De Stiinta Si Tehnologie | Method of detecting anomalous data, machine computing unit, computer program |
CN111767472A (en) * | 2020-07-08 | 2020-10-13 | 吉林大学 | Method and system for detecting abnormal account of social network |
-
2021
- 2021-04-05 CN CN202110364341.7A patent/CN113158076B/en not_active Expired - Fee Related
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106682118A (en) * | 2016-12-08 | 2017-05-17 | 华中科技大学 | Social network site false fan detection method achieved on basis of network crawler by means of machine learning |
CN111556016A (en) * | 2020-03-25 | 2020-08-18 | 中国科学院信息工程研究所 | Network flow abnormal behavior identification method based on automatic encoder |
CN111767472A (en) * | 2020-07-08 | 2020-10-13 | 吉林大学 | Method and system for detecting abnormal account of social network |
EP3719711A2 (en) * | 2020-07-30 | 2020-10-07 | Institutul Roman De Stiinta Si Tehnologie | Method of detecting anomalous data, machine computing unit, computer program |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114330370A (en) * | 2022-03-17 | 2022-04-12 | 天津思睿信息技术有限公司 | Natural language processing system and method based on artificial intelligence |
CN114330370B (en) * | 2022-03-17 | 2022-05-20 | 天津思睿信息技术有限公司 | Natural language processing system and method based on artificial intelligence |
Also Published As
Publication number | Publication date |
---|---|
CN113158076B (en) | 2022-07-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Alrubaian et al. | A credibility analysis system for assessing information on twitter | |
Hu et al. | Social spammer detection with sentiment information | |
Jacob | Multi-objective genetic algorithm and CNN-based deep learning architectural scheme for effective spam detection | |
Ramalingaiah et al. | Twitter bot detection using supervised machine learning | |
Lin et al. | A graph convolutional encoder and decoder model for rumor detection | |
Qiu et al. | An adaptive social spammer detection model with semi-supervised broad learning | |
Feng et al. | A phishing webpage detection method based on stacked autoencoder and correlation coefficients | |
Tehlan et al. | A spam detection mechamism in social media using soft computing | |
Abinaya et al. | Spam detection on social media platforms | |
CN113158076B (en) | Social robot detection method based on variational self-coding and K-nearest neighbor combination | |
Tian et al. | Predicting rumor retweeting behavior of social media users in public emergencies | |
CN114218457B (en) | False news detection method based on forwarding social media user characterization | |
CN106557983B (en) | Microblog junk user detection method based on fuzzy multi-class SVM | |
Yang et al. | A model for early rumor detection base on topic-derived domain compensation and multi-user association | |
Sattikar et al. | A role of artificial intelligence techniques in security and privacy issues of social networking | |
Eckhardt et al. | Convolutional Neural Networks and Long Short Term Memory for Phishing Email Classification | |
İş et al. | A Profile Analysis of User Interaction in Social Media Using Deep Learning. | |
Arora et al. | Significant machine learning and statistical concepts and their applications in social computing | |
CN113157993A (en) | Network water army behavior early warning model based on time sequence graph polarization analysis | |
Mudda et al. | Spatial-aware deep recommender system | |
Pranathi et al. | Logistic regression based cyber harassment identification | |
Yang et al. | Improving blog spam filters via machine learning | |
Guo | Comparison of neural network and traditional classifiers for twitter sentiment analysis | |
Nagavi et al. | Detection and Classification of Toxic Content for Social Media Platforms | |
Chowdhury | Spam identification on Facebook, Twitter and Email using machine learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20220722 |
|
CF01 | Termination of patent right due to non-payment of annual fee |