CN113158076A - Social robot detection method based on variational self-coding and K-nearest neighbor combination - Google Patents

Social robot detection method based on variational self-coding and K-nearest neighbor combination Download PDF

Info

Publication number
CN113158076A
CN113158076A CN202110364341.7A CN202110364341A CN113158076A CN 113158076 A CN113158076 A CN 113158076A CN 202110364341 A CN202110364341 A CN 202110364341A CN 113158076 A CN113158076 A CN 113158076A
Authority
CN
China
Prior art keywords
sample
samples
data
network
coding
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110364341.7A
Other languages
Chinese (zh)
Other versions
CN113158076B (en
Inventor
王秀娟
郑倩倩
郑康锋
随艺
曹思玮
石雨桐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Technology
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology, Beijing University of Posts and Telecommunications filed Critical Beijing University of Technology
Priority to CN202110364341.7A priority Critical patent/CN113158076B/en
Publication of CN113158076A publication Critical patent/CN113158076A/en
Application granted granted Critical
Publication of CN113158076B publication Critical patent/CN113158076B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9536Search customisation based on social or collaborative filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3335Syntactic pre-processing, e.g. stopword elimination, stemming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • G06F18/24155Bayesian classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

A social robot detection method based on variation self-coding and K neighbor combination belongs to the technical field of anomaly detection, and the method comprises the steps of obtaining public data of a social robot through a network, extracting characteristics through preprocessing, training by adopting data, coding and decoding by using variation self-coding, enabling characteristics of normal samples to be more similar to initial characteristics through decoding, enabling abnormal samples to have large difference with the initial characteristics, fusing the original characteristics and the decoded characteristics, and performing anomaly detection by using a K neighbor anomaly detection method. The method considers that in a social network large environment, the number of abnormal user groups is smaller than that of normal user groups, and therefore in the data collection process, collection of abnormal users is relatively troublesome. The method provided by the invention overcomes the defects of high-cost labeling and unbalanced positive and negative samples in the existing social network robot detection method, and realizes the high-efficiency detection of the social network robot user by reducing the participation of abnormal samples in the model training.

Description

Social robot detection method based on variational self-coding and K-nearest neighbor combination
Technical Field
The invention belongs to the technical field of anomaly detection, and particularly relates to social robot detection based on differential self-encoding.
Background
With the great popularization and development of the internet, a large amount of real online user behavior data is provided for researching human behaviors. As 12 months in 2020, the scale of netizens in China reaches 9.89 hundred million, Twitter daily active users reach 1.92 hundred million people, as 9 months in 2020, microblog monthly active users reach 5.11 hundred million, the average daily active user number is 2.24 hundred million, so that huge user quantity generates TB-level data every day, and the data records the abundant internet surfing behaviors of thousands of universal users. Social media has become an indispensable part of people's life to acquire and share information. In summary, social media websites such as Twitter and microblog bring new opportunities to us, and research whether behaviors of users deviate from a normal social mode from objective behavior data, so as to detect users who break network security.
Most people now like and are willing to express emotion, record life and actively make a statement on a mass social media platform, and the whole social network tends to be gradually complicated and diversified, and the problems therewith emerge endlessly. At present, a social robot (i.e. an automated program simulating the behavior of a real normal user in a social network) for various purposes has been created, the creation of the social robot is the aim of serving and improving the quality of life of human beings, but the development of the social robot breaks away from the control of human beings, the social robot can be disguised as an independent entity, some false accounts are created, activities of stealing user privacy, sending spam, spreading malicious links, launching DDoS attacks and the like are implemented, injuries are caused to innocent users, and the social robot becomes a large virus tumor in the social network and harms the health of the social network. According to the report of the us securities and exchange commission, over 2300 million active accounts on Twitter in 2014 are actually social robots, which have become important content production and transmission power in social media. A bed Bot report in 2020, issued by the network security service provider, that concerns the current situation of automated network traffic, indicates that in 2019, malicious machine traffic accounts for more than 24.1%, good machine total traffic accounts for more than 13.1%, human traffic increases by 1.1% in the last year, and total accounts for more than 62.8%, as shown in fig. 1. The robots mentioned in the reports often appear in the form of botnets hiding their traffic originating sources through anonymous proxies and other identity hiding techniques, while disguising themselves as legitimate humans. It is this property that makes them difficult to control. The problem of detecting the robot has a strong meaning. For example, it is a challenging and meaningful task for the robot to detect social robots that influence political elections by distorting network opinions, manipulating stock markets, or pushing anti-vaccine conspiracy opinions that lead to health epidemics.
A social robot is a program that mimics human social behavior. Early detection of bad users in a social network mainly focuses on water army, junk users and zombie powder, along with the appearance of machine users, all circles are aware of negative effects brought by malicious social robots, and because the appearance time of the machine users is late, the research on the machine users is relatively less, and the related research starts late. Researchers have classified social networking users into human users, normal machine users, and malicious machine users. The probability that normal machine users engage in malicious behaviors is small, behavior characteristics are more similar to those of the normal users, and the behavior characteristics are obviously different from those of malicious robots, so that the normal robots can be defined as normal users, the malicious machine users can be defined as machine users, and the detection of malicious social robots can be regarded as a classification problem: if a user is a malicious machine user, it is considered a positive example in the training set, otherwise, the user is a normal user, which is considered a negative example. Most researches consider detection machine users as classification problems, for example, Random Forest models (RF), AdaBoost, linear Regression models (LR) and Decision Tree models (DT) are used as classifiers for prediction respectively, but a classification-based method needs to be trained in advance, and is relatively dependent on the accuracy of training data and various data labels, and an effective scheme is lacking in the problem of category imbalance. The current abnormity detection research result is very remarkable, and the method is more suitable for detecting abnormal users in the social network.
Disclosure of Invention
In a social network large environment, the number of abnormal user groups is small relative to the normal user groups, so the collection of abnormal users is relatively troublesome in the data collection process. In order to reduce the training of abnormal sample participation models, the invention provides a social robot detection method based on the combination of Variational self-encoding (VAE) and abnormal detection, which adopts data to train and then uses the Variational self-encoding to encode and decode, the characteristics of normal samples are more similar to the initial characteristics after decoding, the abnormal samples have large difference with the initial characteristics, the original characteristics and the decoded characteristics are fused, and then the abnormal detection method is used for abnormal detection.
Drawings
FIG. 1 machine flow rate ratio case;
FIG. 2 is a flow chart of the present invention;
FIG. 3 a variation self-coding structure;
FIG. 4 a variation self-coding codec visualization;
Detailed Description
As shown in fig. 2, the invention provides a social robot detection method based on a combination of variational self-coding and anomaly detection, and the inventive method comprises the following steps: step 1, data acquisition and preprocessing, namely processing original text data acquired in a network by using a program to obtain an original characteristic matrix; step 2, generating characteristics through variation self-coding of a depth generation model; and 3, carrying out feature fusion on the original features and the generated features, and detecting the social robot by using an anomaly detection method.
Step 1, data acquisition and preprocessing, namely processing original text data acquired from a network by using a program to obtain an original characteristic matrix
The disclosed social robot data is very little, the invention selects a disclosed CLEF2019 data set with labels, wherein 2880 training sets, 1240 verification sets and 100 tweets per account are adopted, all accounts are marked as robots and normal users (including gender marks), so that the total normal users are 2060, and the machine users are 2060. The social robot data used by the invention is represented as
Figure BDA0003006805370000031
N4020 denotes the number of samples, and i denotes a sample.
After an original data set is obtained, text cleaning is carried out by adopting a powerful natural language processing library NLTK in Python, an open-source third-party Python toolkit-Gensim is used for calculating text similarity between texts sent by each user, and the toolkit is used for unsupervised learning of topic vector expression of a text hidden layer from an original unstructured text. The method supports various topic model algorithms including TF-IDF, LSA, LDA and word2vec, supports streaming training and provides API (application programming interface) interfaces of some common tasks such as similarity calculation, information retrieval and the like. After program processing, the invention extracts 16-dimensional features in total, as follows:
mention of @ proportion of others
Number of average used expressions in tweets;
the number of stop words contained in the tweet on average;
eight dimensions total of the average number of 8 symbols: "#", "," -, ","; ","! "," (",") "; the number of URLs contained in the sent message on average;
the average length of the original tweet;
forwarding the average length of the tweet;
the tweet forwarding amount proportion;
tweet average similarity.
After the 16-dimensional features are obtained, normalization is carried out on each-dimensional feature of the sample, and the normalization formula is as follows:
Figure BDA0003006805370000041
in the above formula, l represents the characteristic dimension of the sample i in the characteristic matrix, the value range of l is 0 to 15, lmax is the maximum value in the dimension of the sample l, lmin is the minimum value in the dimension of the sample l, and the characteristic data set after normalization is represented as
Figure BDA0003006805370000042
Step 2, generating characteristics through a depth generation model
A Variational autoencoder, which is a form of a depth generative model, is a generative network structure inferred based on Variational Bayes (VB). The structure is shown in fig. 3, and the variational self-coding establishes two probability density distribution models by using two neural networks: a variation probability distribution for generating hidden variable used for variation inference of original input data, called inference network; and the other one is used for restoring and generating approximate probability distribution of the original data according to the generated hidden variable variation probability distribution, and is called as a generation network.
From the original sample set obtained in step 1 as
Figure BDA0003006805370000043
Each data sample xiAre randomly generated independent, continuous or discrete distribution variables, and the observable variable X is in a high-dimensional spaceRandom vector as input visible layer variable, then hidden layer variable Z is generated, non-observable variable Z is a random vector of relatively low dimension space, and data set is generated
Figure BDA0003006805370000051
X*The method represents a sample set obtained by encoding and decoding an original sample set through variational self-coding, wherein a variational self-coding generation model can be divided into two processes:
(1) approximate inference process of hidden variable Z posterior distribution: identifying a model qφ(z | x), i.e. the inference network, qφ(z | x) represents one process in which x is known to infer z.
(2) Generating variable X*The condition distribution generation process of (1): conditional distribution pθ(z)pθ(x*I z), i.e. a network is generated.
The core of the variational self-coding is to make qφ(Z | x) and true posterior distribution pθ(z | x) are approximately equal, the optimization goals of the problem transformation into the inference network and the generation network are to maximize a variation lower bound function, the log in the following formula is a logarithm with the base 10, theta (generation network parameter) and phi (inference network parameter) are parameters of the network, the network is initialized before being trained, and then the parameters are updated by training. And L (theta, phi; X) is a variation lower bound function, and the parameters theta and phi are solved by a known sample set X:
Figure BDA0003006805370000052
Figure BDA0003006805370000053
zi=μii·δi
in the above equation, argmax represents the maximum variation lower bound function L,
Figure BDA0003006805370000054
representing data generated by the correspondence of samples i, ziDenotes the hidden variable, μ, corresponding to sample iiRepresenting the mean, δ, of samples i in the inferred networkiThe method is adopted for sampling, in order to finish sampling Z, an auxiliary parameter epsilon is introduced, the parameter is obtained by random sampling from a standard normal distribution N (0,1), epsiloniRepresenting the generation of hidden layer variable z by mapping samples iiData sampled randomly. With the introduction of auxiliary parameters, the relation between the hidden variable Z and the mean variance is changed from sampling calculation to numerical calculation, and the optimization can directly adopt random gradient descent and conditional distribution
Figure BDA0003006805370000055
Obeying a bernoulli or gaussian distribution represents one process in which z is known to infer x for sample i in the generation network. And directly calculating according to a probability density function formula. Then each item of the lower bound of the variation can be directly calculated, parameters theta and phi of all visible units and hidden units are naturally updated according to training, the model structure is finally determined according to the parameters theta and phi, and corresponding data can be generated according to input data. We can visualize the variational self-encoding codec as fig. 4, with encoder input features of 16, hidden variable dimensions of 4, decoder output features of 16, training batch of 2000.
Step 3, after the original characteristic-generated characteristic is subjected to characteristic fusion, detecting the social robot by using an anomaly detection method
The original sample characteristics are coded and decoded through the variational self-coding in the step 2, the normal sample characteristics are more similar to the initial characteristics through decoding, the abnormal sample has large difference with the initial characteristics, and the original characteristic matrix X obtained in the step 1 and the decoded new characteristic X are used*The matrix is fused to obtain
Figure BDA0003006805370000063
As an input to the anomaly detection section.
The abnormal detection method selected by the invention is a k nearest neighbor algorithm, and the KNN algorithm is to give a training number
According to the data set, for a new input instance, k instances which are nearest to the instance are found in the training data set, and if the majority of the k instances belong to a certain class, the input instance is divided into the class. The method comprises the following specific steps:
1. a sample data set (training sample set) with labels is obtained, wherein the sample data set comprises the corresponding relation between each piece of data and the corresponding classification.
2. After inputting new data without labels, each feature of the new data is compared with the corresponding feature of the data in the sample set.
1) And calculating the distance between the new data and each piece of data in the sample data set.
2) All distances found are sorted (from small to large, smaller means more similar).
3) And taking classification labels corresponding to the first k (k is generally less than or equal to 20) sample data.
3. And solving the classification label with the largest occurrence frequency in the k data as the classification of the new data.
The distance measurement of the invention selects common Euclidean distance, wherein the distance calculation formula between two points in the multidimensional space is as follows:
Figure BDA0003006805370000061
in the formula, d (y)i,yj) Representing the euclidean distance between sample i and sample j,
Figure BDA0003006805370000062
an nth-dimension feature value representing the sample i,
Figure BDA0003006805370000071
and representing the N-th dimension characteristic value of the sample j, i is more than or equal to 1 and less than or equal to N, and j is more than or equal to 1 and less than or equal to N.
In addition, the classification decision is usually to pick the label with the most votes by a few obedients to the majority in the classification problem, and is usually the average of the labels of the K nearest neighbors in the regression problem, and the invention also selects the average. In addition, in the experiment of the invention, the effect is better when K is 5, the training time and AUC are used as evaluation indexes in the experiment, and the AUC is the area under the ROC curve (characteristic curve of the operation of the testee). The AUC algorithm is the ratio of the positive samples in all positive and negative sample pairs to the logarithm of the positive samples in the positive samples before the negative samples, i.e., the probability value. The AUC calculation formula is as follows.
Figure BDA0003006805370000072
In the above formula, M represents the number of positive samples, P represents the number of negative samples, and the specific way of AUC statistics is to sort the probability values from large to small, then let rank of the sample corresponding to the maximum probability value be N, N be the number of samples, rank of the sample corresponding to the second maximum value is N-1, and so on. Then, rank of all positive class samples is added, and the M value with the minimum score of the positive class samples is subtracted. What results is how much of all the samples have a score for the positive type that is greater than the score for the negative type. And finally divided by mxp. The experimental results of the invention are as follows:
AUC time
VAE-KNN 0.9287 0.1157
VAE-Mean_KNN 0.941 0.1077
VAE-Media_knn 0.9351 0.1396

Claims (2)

1. a social robot detection method based on variation self-coding and K neighbor combination is characterized by comprising the following steps:
step 1, data acquisition and preprocessing, wherein acquired original data are processed by a program to obtain an original characteristic matrix;
step 2, generating characteristics through a depth generation model;
step 3, after the original features and the generated features are subjected to feature fusion, detecting the social robot by using an anomaly detection method;
the method specifically comprises the following steps: firstly, an original social robot data set is obtained and processed to obtain a characteristic matrix expressed as
Figure FDA0003006805360000011
i is a sample; the features extracted are as follows:
mention of @ proportion of others
Number of average used expressions in tweets;
the number of stop words contained in the tweet on average;
eight dimensions total of the average number of 8 symbols: "#", "," -, ","; ","! "," (",") ";
the number of URLs contained in the sent message on average;
the average length of the original tweet;
forwarding the average length of the tweet;
the tweet forwarding amount proportion;
tweet average similarity;
after the characteristics are obtained, normalization is carried out on each dimension characteristic of the sample, and the normalization formula is as follows:
Figure FDA0003006805360000012
in the above formula, l represents the characteristic dimension of the sample i in the characteristic matrix, lmax is the maximum value in the dimension of the sample data l, lmin is the minimum value in the dimension of the sample data l, and the characteristic data set after normalization is represented as
Figure FDA0003006805360000013
Using a variational auto-encoder (VAE) as a depth generation model, a sample set of
Figure FDA0003006805360000014
Each data sample xiAll are randomly generated independent, continuous or discrete distributed variables, the observable variable X is used as an input visible layer variable, then a hidden layer variable Z is generated, and a data set is generated
Figure FDA0003006805360000015
X*The method comprises the following steps of representing a sample set obtained after an original sample set is coded and decoded through variational self-coding, wherein a variational self-coding generation model is divided into two processes:
(1) approximate inference process of hidden variable Z posterior distribution: identifying a model qφ(z | x), i.e. the inference network, qφ(z | x) represents a process where x is known to infer z;
(2) generating variable X*The condition distribution generation process of (1): conditional distribution pθ(z)pθ(x*| z), namely generating a network;
the optimization targets of the inference network and the generation network are both maximization variational lower bound functions, log in the formula represents logarithm with base 10, theta generation network parameters and phi inference network parameters are parameters of the network, the network is initialized before being trained, and then the parameters are updated by training; and L (theta, phi; X) is a variation lower bound function, and the parameters theta and phi are solved by a known sample set X:
Figure FDA0003006805360000021
Figure FDA0003006805360000022
zi=μii·δi
argmax represents the maximum variation lower bound function L,
Figure FDA0003006805360000023
representing data generated by the correspondence of samples i, ziRepresenting hidden variables corresponding to the samples i, mu i representing the mean value of the samples i in the inference network, delta i representing the variance of the samples i in the inference network, introducing an auxiliary parameter epsilon, wherein the parameter is obtained by random sampling from a standard normal distribution N (0,1), and epsiloniRepresenting the generation of hidden layer variable z by mapping samples iiData sampled randomly; with the introduction of auxiliary parameters, the relation between the hidden variable Z and the mean variance is changed from sampling calculation to numerical calculation, the optimization directly adopts the random gradient descent and the condition distribution
Figure FDA0003006805360000024
Obeying a bernoulli or gaussian distribution to represent a process of inferring x for z for sample i, known in the generation network; directly calculating according to a probability density function formula of the target; then each term of the lower bound of the variation is directly calculated, and the parameters theta and phi of all visible units and hidden units are naturally updated according to the training, so that the model structure is finally determined.
2. The social robot detection method based on the combination of variational self-coding and K neighbors of claim 1, wherein the variational self-coding encodes and decodes original sample features, and the original sample features are decodedFeature matrix X and decoded new feature X*The matrixes are fused to obtain a matrix
Figure FDA0003006805360000025
The abnormal detection of a non-parametric and inert k nearest neighbor algorithm is very suitable for abnormal detection, for a given training data set, for a new input example, the algorithm needs to find k examples which are nearest to the example in the training data set, if the majority of the k examples belong to a certain class, the input example is divided into the class, the value of k is greater than 0, and the upper limit is less than or equal to 20; the distance metric for both instances is calculated as Euclidean distance, where the distance between two points in the multidimensional space is calculated as follows:
Figure FDA0003006805360000026
in the formula, d (y)i,yj) Representing the euclidean distance between sample i and sample j,
Figure FDA0003006805360000027
an nth-dimension feature value representing the sample i,
Figure FDA0003006805360000028
representing the nth dimension characteristic value of the sample j, i is more than or equal to 1 and less than or equal to N, and j is more than or equal to 1 and less than or equal to N;
in addition, in the classification problem, the regression problem is the average value of the labels of the K nearest neighbors, namely the average value of K nearest examples calculated by the distance; in addition, the AUC is taken as an evaluation index, the AUC is the area under the operating characteristic curve of the ROC curve subject, and the AUC is the proportion of the positive sample in front of the negative sample in all the positive and negative sample pairs, which accounts for the logarithm of the sample, that is, the probability value; the AUC calculation formula is as follows;
Figure FDA0003006805360000031
in the above formula, M represents the number of positive samples, P represents the number of negative samples, and the specific way of AUC statistics is to sort the probability values from large to small, then let rank of the sample corresponding to the maximum probability value be N, N be the number of samples, rank of the sample corresponding to the second maximum value is N-1, and so on; then adding rank of all positive samples, and subtracting the M value with the minimum score of the positive samples; what is obtained is how much of all samples the score for the positive type is greater than the score for the negative type; and finally divided by mxp.
CN202110364341.7A 2021-04-05 2021-04-05 Social robot detection method based on variational self-coding and K-nearest neighbor combination Expired - Fee Related CN113158076B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110364341.7A CN113158076B (en) 2021-04-05 2021-04-05 Social robot detection method based on variational self-coding and K-nearest neighbor combination

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110364341.7A CN113158076B (en) 2021-04-05 2021-04-05 Social robot detection method based on variational self-coding and K-nearest neighbor combination

Publications (2)

Publication Number Publication Date
CN113158076A true CN113158076A (en) 2021-07-23
CN113158076B CN113158076B (en) 2022-07-22

Family

ID=76888430

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110364341.7A Expired - Fee Related CN113158076B (en) 2021-04-05 2021-04-05 Social robot detection method based on variational self-coding and K-nearest neighbor combination

Country Status (1)

Country Link
CN (1) CN113158076B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114330370A (en) * 2022-03-17 2022-04-12 天津思睿信息技术有限公司 Natural language processing system and method based on artificial intelligence

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106682118A (en) * 2016-12-08 2017-05-17 华中科技大学 Social network site false fan detection method achieved on basis of network crawler by means of machine learning
CN111556016A (en) * 2020-03-25 2020-08-18 中国科学院信息工程研究所 Network flow abnormal behavior identification method based on automatic encoder
EP3719711A2 (en) * 2020-07-30 2020-10-07 Institutul Roman De Stiinta Si Tehnologie Method of detecting anomalous data, machine computing unit, computer program
CN111767472A (en) * 2020-07-08 2020-10-13 吉林大学 Method and system for detecting abnormal account of social network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106682118A (en) * 2016-12-08 2017-05-17 华中科技大学 Social network site false fan detection method achieved on basis of network crawler by means of machine learning
CN111556016A (en) * 2020-03-25 2020-08-18 中国科学院信息工程研究所 Network flow abnormal behavior identification method based on automatic encoder
CN111767472A (en) * 2020-07-08 2020-10-13 吉林大学 Method and system for detecting abnormal account of social network
EP3719711A2 (en) * 2020-07-30 2020-10-07 Institutul Roman De Stiinta Si Tehnologie Method of detecting anomalous data, machine computing unit, computer program

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114330370A (en) * 2022-03-17 2022-04-12 天津思睿信息技术有限公司 Natural language processing system and method based on artificial intelligence
CN114330370B (en) * 2022-03-17 2022-05-20 天津思睿信息技术有限公司 Natural language processing system and method based on artificial intelligence

Also Published As

Publication number Publication date
CN113158076B (en) 2022-07-22

Similar Documents

Publication Publication Date Title
Alrubaian et al. A credibility analysis system for assessing information on twitter
Hu et al. Social spammer detection with sentiment information
Jacob Multi-objective genetic algorithm and CNN-based deep learning architectural scheme for effective spam detection
Ramalingaiah et al. Twitter bot detection using supervised machine learning
Lin et al. A graph convolutional encoder and decoder model for rumor detection
Qiu et al. An adaptive social spammer detection model with semi-supervised broad learning
Feng et al. A phishing webpage detection method based on stacked autoencoder and correlation coefficients
Tehlan et al. A spam detection mechamism in social media using soft computing
Abinaya et al. Spam detection on social media platforms
CN113158076B (en) Social robot detection method based on variational self-coding and K-nearest neighbor combination
Tian et al. Predicting rumor retweeting behavior of social media users in public emergencies
CN114218457B (en) False news detection method based on forwarding social media user characterization
CN106557983B (en) Microblog junk user detection method based on fuzzy multi-class SVM
Yang et al. A model for early rumor detection base on topic-derived domain compensation and multi-user association
Sattikar et al. A role of artificial intelligence techniques in security and privacy issues of social networking
Eckhardt et al. Convolutional Neural Networks and Long Short Term Memory for Phishing Email Classification
İş et al. A Profile Analysis of User Interaction in Social Media Using Deep Learning.
Arora et al. Significant machine learning and statistical concepts and their applications in social computing
CN113157993A (en) Network water army behavior early warning model based on time sequence graph polarization analysis
Mudda et al. Spatial-aware deep recommender system
Pranathi et al. Logistic regression based cyber harassment identification
Yang et al. Improving blog spam filters via machine learning
Guo Comparison of neural network and traditional classifiers for twitter sentiment analysis
Nagavi et al. Detection and Classification of Toxic Content for Social Media Platforms
Chowdhury Spam identification on Facebook, Twitter and Email using machine learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20220722

CF01 Termination of patent right due to non-payment of annual fee