CN113158076A

CN113158076A - Social robot detection method based on variational self-coding and K-nearest neighbor combination

Info

Publication number: CN113158076A
Application number: CN202110364341.7A
Authority: CN
Inventors: 王秀娟; 郑倩倩; 郑康锋; 随艺; 曹思玮; 石雨桐
Original assignee: Beijing University of Technology; Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Technology; Beijing University of Posts and Telecommunications
Priority date: 2021-04-05
Filing date: 2021-04-05
Publication date: 2021-07-23
Anticipated expiration: 2041-04-05
Also published as: CN113158076B

Abstract

A social robot detection method based on variation self-coding and K neighbor combination belongs to the technical field of anomaly detection, and the method comprises the steps of obtaining public data of a social robot through a network, extracting characteristics through preprocessing, training by adopting data, coding and decoding by using variation self-coding, enabling characteristics of normal samples to be more similar to initial characteristics through decoding, enabling abnormal samples to have large difference with the initial characteristics, fusing the original characteristics and the decoded characteristics, and performing anomaly detection by using a K neighbor anomaly detection method. The method considers that in a social network large environment, the number of abnormal user groups is smaller than that of normal user groups, and therefore in the data collection process, collection of abnormal users is relatively troublesome. The method provided by the invention overcomes the defects of high-cost labeling and unbalanced positive and negative samples in the existing social network robot detection method, and realizes the high-efficiency detection of the social network robot user by reducing the participation of abnormal samples in the model training.

Description

Social robot detection method based on variational self-coding and K-nearest neighbor combination

Technical Field

The invention belongs to the technical field of anomaly detection, and particularly relates to social robot detection based on differential self-encoding.

Background

With the great popularization and development of the internet, a large amount of real online user behavior data is provided for researching human behaviors. As 12 months in 2020, the scale of netizens in China reaches 9.89 hundred million, Twitter daily active users reach 1.92 hundred million people, as 9 months in 2020, microblog monthly active users reach 5.11 hundred million, the average daily active user number is 2.24 hundred million, so that huge user quantity generates TB-level data every day, and the data records the abundant internet surfing behaviors of thousands of universal users. Social media has become an indispensable part of people's life to acquire and share information. In summary, social media websites such as Twitter and microblog bring new opportunities to us, and research whether behaviors of users deviate from a normal social mode from objective behavior data, so as to detect users who break network security.

Most people now like and are willing to express emotion, record life and actively make a statement on a mass social media platform, and the whole social network tends to be gradually complicated and diversified, and the problems therewith emerge endlessly. At present, a social robot (i.e. an automated program simulating the behavior of a real normal user in a social network) for various purposes has been created, the creation of the social robot is the aim of serving and improving the quality of life of human beings, but the development of the social robot breaks away from the control of human beings, the social robot can be disguised as an independent entity, some false accounts are created, activities of stealing user privacy, sending spam, spreading malicious links, launching DDoS attacks and the like are implemented, injuries are caused to innocent users, and the social robot becomes a large virus tumor in the social network and harms the health of the social network. According to the report of the us securities and exchange commission, over 2300 million active accounts on Twitter in 2014 are actually social robots, which have become important content production and transmission power in social media. A bed Bot report in 2020, issued by the network security service provider, that concerns the current situation of automated network traffic, indicates that in 2019, malicious machine traffic accounts for more than 24.1%, good machine total traffic accounts for more than 13.1%, human traffic increases by 1.1% in the last year, and total accounts for more than 62.8%, as shown in fig. 1. The robots mentioned in the reports often appear in the form of botnets hiding their traffic originating sources through anonymous proxies and other identity hiding techniques, while disguising themselves as legitimate humans. It is this property that makes them difficult to control. The problem of detecting the robot has a strong meaning. For example, it is a challenging and meaningful task for the robot to detect social robots that influence political elections by distorting network opinions, manipulating stock markets, or pushing anti-vaccine conspiracy opinions that lead to health epidemics.

A social robot is a program that mimics human social behavior. Early detection of bad users in a social network mainly focuses on water army, junk users and zombie powder, along with the appearance of machine users, all circles are aware of negative effects brought by malicious social robots, and because the appearance time of the machine users is late, the research on the machine users is relatively less, and the related research starts late. Researchers have classified social networking users into human users, normal machine users, and malicious machine users. The probability that normal machine users engage in malicious behaviors is small, behavior characteristics are more similar to those of the normal users, and the behavior characteristics are obviously different from those of malicious robots, so that the normal robots can be defined as normal users, the malicious machine users can be defined as machine users, and the detection of malicious social robots can be regarded as a classification problem: if a user is a malicious machine user, it is considered a positive example in the training set, otherwise, the user is a normal user, which is considered a negative example. Most researches consider detection machine users as classification problems, for example, Random Forest models (RF), AdaBoost, linear Regression models (LR) and Decision Tree models (DT) are used as classifiers for prediction respectively, but a classification-based method needs to be trained in advance, and is relatively dependent on the accuracy of training data and various data labels, and an effective scheme is lacking in the problem of category imbalance. The current abnormity detection research result is very remarkable, and the method is more suitable for detecting abnormal users in the social network.

Disclosure of Invention

In a social network large environment, the number of abnormal user groups is small relative to the normal user groups, so the collection of abnormal users is relatively troublesome in the data collection process. In order to reduce the training of abnormal sample participation models, the invention provides a social robot detection method based on the combination of Variational self-encoding (VAE) and abnormal detection, which adopts data to train and then uses the Variational self-encoding to encode and decode, the characteristics of normal samples are more similar to the initial characteristics after decoding, the abnormal samples have large difference with the initial characteristics, the original characteristics and the decoded characteristics are fused, and then the abnormal detection method is used for abnormal detection.

Drawings

FIG. 1 machine flow rate ratio case;

FIG. 2 is a flow chart of the present invention;

FIG. 3 a variation self-coding structure;

FIG. 4 a variation self-coding codec visualization;

Detailed Description

As shown in fig. 2, the invention provides a social robot detection method based on a combination of variational self-coding and anomaly detection, and the inventive method comprises the following steps: step 1, data acquisition and preprocessing, namely processing original text data acquired in a network by using a program to obtain an original characteristic matrix; step 2, generating characteristics through variation self-coding of a depth generation model; and 3, carrying out feature fusion on the original features and the generated features, and detecting the social robot by using an anomaly detection method.

Step 1, data acquisition and preprocessing, namely processing original text data acquired from a network by using a program to obtain an original characteristic matrix

The disclosed social robot data is very little, the invention selects a disclosed CLEF2019 data set with labels, wherein 2880 training sets, 1240 verification sets and 100 tweets per account are adopted, all accounts are marked as robots and normal users (including gender marks), so that the total normal users are 2060, and the machine users are 2060. The social robot data used by the invention is represented as

N4020 denotes the number of samples, and i denotes a sample.

After an original data set is obtained, text cleaning is carried out by adopting a powerful natural language processing library NLTK in Python, an open-source third-party Python toolkit-Gensim is used for calculating text similarity between texts sent by each user, and the toolkit is used for unsupervised learning of topic vector expression of a text hidden layer from an original unstructured text. The method supports various topic model algorithms including TF-IDF, LSA, LDA and word2vec, supports streaming training and provides API (application programming interface) interfaces of some common tasks such as similarity calculation, information retrieval and the like. After program processing, the invention extracts 16-dimensional features in total, as follows:

mention of @ proportion of others

Number of average used expressions in tweets;

the number of stop words contained in the tweet on average;

eight dimensions total of the average number of 8 symbols: "#", "," -, ","; ","! "," (",") "; the number of URLs contained in the sent message on average;

the average length of the original tweet;

forwarding the average length of the tweet;

the tweet forwarding amount proportion;

tweet average similarity.

After the 16-dimensional features are obtained, normalization is carried out on each-dimensional feature of the sample, and the normalization formula is as follows:

in the above formula, l represents the characteristic dimension of the sample i in the characteristic matrix, the value range of l is 0 to 15, lmax is the maximum value in the dimension of the sample l, lmin is the minimum value in the dimension of the sample l, and the characteristic data set after normalization is represented as

Step 2, generating characteristics through a depth generation model

A Variational autoencoder, which is a form of a depth generative model, is a generative network structure inferred based on Variational Bayes (VB). The structure is shown in fig. 3, and the variational self-coding establishes two probability density distribution models by using two neural networks: a variation probability distribution for generating hidden variable used for variation inference of original input data, called inference network; and the other one is used for restoring and generating approximate probability distribution of the original data according to the generated hidden variable variation probability distribution, and is called as a generation network.

From the original sample set obtained in step 1 as

Each data sample x_iAre randomly generated independent, continuous or discrete distribution variables, and the observable variable X is in a high-dimensional spaceRandom vector as input visible layer variable, then hidden layer variable Z is generated, non-observable variable Z is a random vector of relatively low dimension space, and data set is generated

X^*The method represents a sample set obtained by encoding and decoding an original sample set through variational self-coding, wherein a variational self-coding generation model can be divided into two processes:

(1) approximate inference process of hidden variable Z posterior distribution: identifying a model q_φ(z | x), i.e. the inference network, q_φ(z | x) represents one process in which x is known to infer z.

(2) Generating variable X^*The condition distribution generation process of (1): conditional distribution p_θ(z)p_θ(x^*I z), i.e. a network is generated.

The core of the variational self-coding is to make q_φ(Z | x) and true posterior distribution p_θ(z | x) are approximately equal, the optimization goals of the problem transformation into the inference network and the generation network are to maximize a variation lower bound function, the log in the following formula is a logarithm with the base 10, theta (generation network parameter) and phi (inference network parameter) are parameters of the network, the network is initialized before being trained, and then the parameters are updated by training. And L (theta, phi; X) is a variation lower bound function, and the parameters theta and phi are solved by a known sample set X:

zⁱ＝μⁱ+εⁱ·δⁱ

in the above equation, argmax represents the maximum variation lower bound function L,

representing data generated by the correspondence of samples i, zⁱDenotes the hidden variable, μ, corresponding to sample iⁱRepresenting the mean, δ, of samples i in the inferred networkⁱThe method is adopted for sampling, in order to finish sampling Z, an auxiliary parameter epsilon is introduced, the parameter is obtained by random sampling from a standard normal distribution N (0,1), epsilonⁱRepresenting the generation of hidden layer variable z by mapping samples iⁱData sampled randomly. With the introduction of auxiliary parameters, the relation between the hidden variable Z and the mean variance is changed from sampling calculation to numerical calculation, and the optimization can directly adopt random gradient descent and conditional distribution

Obeying a bernoulli or gaussian distribution represents one process in which z is known to infer x for sample i in the generation network. And directly calculating according to a probability density function formula. Then each item of the lower bound of the variation can be directly calculated, parameters theta and phi of all visible units and hidden units are naturally updated according to training, the model structure is finally determined according to the parameters theta and phi, and corresponding data can be generated according to input data. We can visualize the variational self-encoding codec as fig. 4, with encoder input features of 16, hidden variable dimensions of 4, decoder output features of 16, training batch of 2000.

Step 3, after the original characteristic-generated characteristic is subjected to characteristic fusion, detecting the social robot by using an anomaly detection method

The original sample characteristics are coded and decoded through the variational self-coding in the step 2, the normal sample characteristics are more similar to the initial characteristics through decoding, the abnormal sample has large difference with the initial characteristics, and the original characteristic matrix X obtained in the step 1 and the decoded new characteristic X are used^*The matrix is fused to obtain

As an input to the anomaly detection section.

The abnormal detection method selected by the invention is a k nearest neighbor algorithm, and the KNN algorithm is to give a training number

According to the data set, for a new input instance, k instances which are nearest to the instance are found in the training data set, and if the majority of the k instances belong to a certain class, the input instance is divided into the class. The method comprises the following specific steps:

1. a sample data set (training sample set) with labels is obtained, wherein the sample data set comprises the corresponding relation between each piece of data and the corresponding classification.

2. After inputting new data without labels, each feature of the new data is compared with the corresponding feature of the data in the sample set.

1) And calculating the distance between the new data and each piece of data in the sample data set.

2) All distances found are sorted (from small to large, smaller means more similar).

3) And taking classification labels corresponding to the first k (k is generally less than or equal to 20) sample data.

3. And solving the classification label with the largest occurrence frequency in the k data as the classification of the new data.

The distance measurement of the invention selects common Euclidean distance, wherein the distance calculation formula between two points in the multidimensional space is as follows:

in the formula, d (y)_i,y_j) Representing the euclidean distance between sample i and sample j,

an nth-dimension feature value representing the sample i,

and representing the N-th dimension characteristic value of the sample j, i is more than or equal to 1 and less than or equal to N, and j is more than or equal to 1 and less than or equal to N.

In addition, the classification decision is usually to pick the label with the most votes by a few obedients to the majority in the classification problem, and is usually the average of the labels of the K nearest neighbors in the regression problem, and the invention also selects the average. In addition, in the experiment of the invention, the effect is better when K is 5, the training time and AUC are used as evaluation indexes in the experiment, and the AUC is the area under the ROC curve (characteristic curve of the operation of the testee). The AUC algorithm is the ratio of the positive samples in all positive and negative sample pairs to the logarithm of the positive samples in the positive samples before the negative samples, i.e., the probability value. The AUC calculation formula is as follows.

In the above formula, M represents the number of positive samples, P represents the number of negative samples, and the specific way of AUC statistics is to sort the probability values from large to small, then let rank of the sample corresponding to the maximum probability value be N, N be the number of samples, rank of the sample corresponding to the second maximum value is N-1, and so on. Then, rank of all positive class samples is added, and the M value with the minimum score of the positive class samples is subtracted. What results is how much of all the samples have a score for the positive type that is greater than the score for the negative type. And finally divided by mxp. The experimental results of the invention are as follows:

	AUC	time
			VAE-KNN	0.9287	0.1157
VAE-Mean_KNN	0.941	0.1077
			VAE-Media_knn	0.9351	0.1396

Claims

1. a social robot detection method based on variation self-coding and K neighbor combination is characterized by comprising the following steps:

step 1, data acquisition and preprocessing, wherein acquired original data are processed by a program to obtain an original characteristic matrix;

step 2, generating characteristics through a depth generation model;

step 3, after the original features and the generated features are subjected to feature fusion, detecting the social robot by using an anomaly detection method;

the method specifically comprises the following steps: firstly, an original social robot data set is obtained and processed to obtain a characteristic matrix expressed as

i is a sample; the features extracted are as follows:

mention of @ proportion of others

Number of average used expressions in tweets;

the number of stop words contained in the tweet on average;

eight dimensions total of the average number of 8 symbols: "#", "," -, ","; ","! "," (",") ";

the number of URLs contained in the sent message on average;

the average length of the original tweet;

forwarding the average length of the tweet;

the tweet forwarding amount proportion;

tweet average similarity;

after the characteristics are obtained, normalization is carried out on each dimension characteristic of the sample, and the normalization formula is as follows:

in the above formula, l represents the characteristic dimension of the sample i in the characteristic matrix, lmax is the maximum value in the dimension of the sample data l, lmin is the minimum value in the dimension of the sample data l, and the characteristic data set after normalization is represented as

Using a variational auto-encoder (VAE) as a depth generation model, a sample set of

Each data sample x_iAll are randomly generated independent, continuous or discrete distributed variables, the observable variable X is used as an input visible layer variable, then a hidden layer variable Z is generated, and a data set is generated

X^*The method comprises the following steps of representing a sample set obtained after an original sample set is coded and decoded through variational self-coding, wherein a variational self-coding generation model is divided into two processes:

(1) approximate inference process of hidden variable Z posterior distribution: identifying a model q_φ(z | x), i.e. the inference network, q_φ(z | x) represents a process where x is known to infer z;

(2) generating variable X^*The condition distribution generation process of (1): conditional distribution p_θ(z)p_θ(x^*| z), namely generating a network;

the optimization targets of the inference network and the generation network are both maximization variational lower bound functions, log in the formula represents logarithm with base 10, theta generation network parameters and phi inference network parameters are parameters of the network, the network is initialized before being trained, and then the parameters are updated by training; and L (theta, phi; X) is a variation lower bound function, and the parameters theta and phi are solved by a known sample set X:

zⁱ＝μⁱ+εⁱ·δⁱ

argmax represents the maximum variation lower bound function L,

representing data generated by the correspondence of samples i, zⁱRepresenting hidden variables corresponding to the samples i, mu i representing the mean value of the samples i in the inference network, delta i representing the variance of the samples i in the inference network, introducing an auxiliary parameter epsilon, wherein the parameter is obtained by random sampling from a standard normal distribution N (0,1), and epsilonⁱRepresenting the generation of hidden layer variable z by mapping samples iⁱData sampled randomly; with the introduction of auxiliary parameters, the relation between the hidden variable Z and the mean variance is changed from sampling calculation to numerical calculation, the optimization directly adopts the random gradient descent and the condition distribution

Obeying a bernoulli or gaussian distribution to represent a process of inferring x for z for sample i, known in the generation network; directly calculating according to a probability density function formula of the target; then each term of the lower bound of the variation is directly calculated, and the parameters theta and phi of all visible units and hidden units are naturally updated according to the training, so that the model structure is finally determined.

2. The social robot detection method based on the combination of variational self-coding and K neighbors of claim 1, wherein the variational self-coding encodes and decodes original sample features, and the original sample features are decodedFeature matrix X and decoded new feature X^*The matrixes are fused to obtain a matrix

The abnormal detection of a non-parametric and inert k nearest neighbor algorithm is very suitable for abnormal detection, for a given training data set, for a new input example, the algorithm needs to find k examples which are nearest to the example in the training data set, if the majority of the k examples belong to a certain class, the input example is divided into the class, the value of k is greater than 0, and the upper limit is less than or equal to 20; the distance metric for both instances is calculated as Euclidean distance, where the distance between two points in the multidimensional space is calculated as follows:

an nth-dimension feature value representing the sample i,

representing the nth dimension characteristic value of the sample j, i is more than or equal to 1 and less than or equal to N, and j is more than or equal to 1 and less than or equal to N;

in addition, in the classification problem, the regression problem is the average value of the labels of the K nearest neighbors, namely the average value of K nearest examples calculated by the distance; in addition, the AUC is taken as an evaluation index, the AUC is the area under the operating characteristic curve of the ROC curve subject, and the AUC is the proportion of the positive sample in front of the negative sample in all the positive and negative sample pairs, which accounts for the logarithm of the sample, that is, the probability value; the AUC calculation formula is as follows;

in the above formula, M represents the number of positive samples, P represents the number of negative samples, and the specific way of AUC statistics is to sort the probability values from large to small, then let rank of the sample corresponding to the maximum probability value be N, N be the number of samples, rank of the sample corresponding to the second maximum value is N-1, and so on; then adding rank of all positive samples, and subtracting the M value with the minimum score of the positive samples; what is obtained is how much of all samples the score for the positive type is greater than the score for the negative type; and finally divided by mxp.