CN111274403A

CN111274403A - Network spoofing detection method

Info

Publication number: CN111274403A
Application number: CN202010083486.5A
Authority: CN
Inventors: 赵泽华; 高旻; 罗逢吉; 王润生; 钟将; 吴映波; 熊庆宇; 文俊浩
Original assignee: Chongqing University
Current assignee: Chongqing University
Priority date: 2020-02-09
Filing date: 2020-02-09
Publication date: 2020-06-12
Anticipated expiration: 2040-02-09
Also published as: CN111274403B

Abstract

The invention discloses a network spoofing detection method, which comprises the following steps: 1) representing a user U according to a social network G (U, R)_pThe social information of (1); 2) accurately represents W_hEach word in (1), in particular a sparse vocabulary; 3) fusing the social information representation and the text information representation obtained in step 1) and step 2), and assigning S based thereon_hThe correct text label; the social network is represented as G (U, R), the node set U represents a user set, the edge set R represents an attention relationship set among users, a set of unmarked short texts published by all the users in G is represented as S, and the order is given

Representing a label category set, wherein k is the number of label categories, each text in S can be endowed with only one category label, and the order

Representing a user U_pPublished text S_hE set of words in S, where l is the short text S_hEach short text in S belongs to only one user. The method has better detection performance, and can effectively improve the accuracy of the detection of the deception text.

Description

Network spoofing detection method

Technical Field

The invention relates to the technical field of network security, in particular to a network spoofing detection method.

Background

Network spoofing detection algorithms in the prior art can generally be divided into two categories: text-based detection methods and multimodal-based detection methods. The detection method based on the text only uses the text information for detection, and is the most common network spoofing detection method. Most of the recent text-based detection methods adopt a word embedding technology to perform text representation learning, and the main advantage is that the detection efficiency is greatly improved through low-dimensional word representation. However, word embedding techniques still have limitations in dealing with "intentionally ambiguous words" in network spoofed text. By "intentionally obscuring words" is meant that in the event of network spoofing, the spoofing information issuer artificially obscures some words for the purpose of avoiding system detection. This purposefully created deceptive vocabulary is very sparse in the corpus. Conventional word embedding techniques typically remove sparse words directly during the preprocessing stage or represent them by random vectors. Considering that these sparse "intentionally ambiguous words" often play a crucial role in the task of detecting spoofing, conventional word embedding techniques do not work well for network spoofing detection tasks.

In addition, in some cases, detecting network spoofing events also requires consideration of context information, rather than merely parsing the text itself. For example, if "you should wear a skirt" is sent to a girl, it can be regarded as a normal expression; but if it is sent to the boy, the probability is a deceptive text. To address this problem, researchers have proposed multi-modal based detection methods. Such methods incorporate additional information (e.g., gender, age, and educational status) of the text to be detected to improve detection performance. However, the additional information is usually private data of the user and is difficult to obtain. Therefore, many multi-modality based detection methods choose to use meta-information (e.g., a picture of a tweet) as the additional information. These meta-information can be easily obtained through an API provided by the social platform, in contrast to the user's private data. However, some studies have shown that most meta-information is not valid in the issue of spoofing detection.

Disclosure of Invention

The present invention is directed to solving the above problems and providing a network spoofing detecting method with better effectiveness.

In order to achieve the purpose, the invention adopts the following technical scheme: a method of network spoofing detection comprising the steps of:

1) representing a user U according to a social network G (U, R)_pThe social information of (1);

2) accurately represents W_hEach word in (1), in particular a sparse vocabulary;

3) fusing the social information representation and the text information representation obtained in step 1) and step 2), and assigning S based thereon_hThe correct text label;

the social network is represented as G (U, R), the node set U represents a user set, the edge set R represents an attention relationship set among users, a set of unmarked short texts published by all the users in G is represented as S, and the order is given

Representing a user U_pPublished text S_hE set of words in S, where l is the short text S_hEach short text in S belongs to only one user.

Further, in the step 1), according to the social network G, a group of node sequences is obtained through random walk, the sequences contain social information of the user, and in the random walk process, one node U is sampled randomly_iAnd using it as root node, then randomly sampling a U_iNeighbor node U of_jNext to U_jAs a root node, repeating the process until a preset sampling frequency threshold value is reached; after the random walk process is completed, the social information representation of each user node is learned using the Skip-Gram algorithm.

Further, in step 2), for a given corpus, the word embedding method based on word co-occurrence vector similarity sequentially generates: (1) a co-occurrence matrix

Where d is the size of the corpus; (2) sparse vocabulary list in corpus

Taking words with the word frequency lower than a predefined threshold value b as sparse words; (3) second order neighborhood list for each sparse word

wherein

Is the sparse word O_iThe length of the second-order neighborhood list, and then obtaining a similarity matrix according to the co-occurrence matrix C

The calculation formula is shown as formula (1);

wherein ,

representing a standard Euclidean distance; f ═ maxfri;

the word embedding method based on the similarity of the word co-occurrence vectors is based on a self-encoder model, the self-encoder comprises an encoder and a decoder, the word embedding method based on the similarity of the word co-occurrence vectors adopts a two-layer fully-connected neural network as the encoder, the input of the encoder is a co-occurrence matrix C, the output of the encoder is a text representation matrix L, the decoder of the word embedding method based on the similarity of the word co-occurrence vectors needs to reconstruct the co-occurrence matrix C and a similarity matrix S, the word embedding method based on the similarity of the word co-occurrence vectors uses the other two-layer fully-connected neural network as the decoder, and the decoder needs to generate a reconstructed co-occurrence matrix

And generating a reconstruction similarity matrix according to the formula (2)

The training process of the word embedding method based on word co-occurrence vector similarity is expressed as follows:

further, in the step 3), the social information representation and the text information representation learned in the step 1) and the step 2) are fused, and a fusion vector is generated in a splicing mode, wherein the dimension of the fusion vector isIs the maximum text length plus one (representing the user U)_p) The first line of the fusion vector is represented by social information of a user who issues the short text, each subsequent line is represented by words in a sentence in a corresponding sequence, and if the length of the sentence is smaller than the maximum sentence length, a zero vector is spliced at the end of the fusion vector; the fused vector is input into an octopus classifier to identify whether the short text is an octopus.

Further, the deception text classifier is based on a neural network structure, a bidirectional long-short term memory network is used as a detector, and the size of an Input Layer in the classifier is maxl + 1; output Layer has k neurons representing k text classes, and the droprates of Dropout Layer 1 and Dropout Layer 2 are set to 0.25 and 0.5, respectively, using the Softmax function as the activation function.

Compared with the prior art, the invention has the following beneficial effects: the method has better detection performance, and the social information can effectively improve the accuracy of the deception text detection.

Drawings

Fig. 1 is an example of an intentional fuzzy word.

Fig. 2 is a diagram of a word embedding technique architecture based on similarity.

Fig. 3 is a general architecture diagram of a network spoofing detection method.

Figure 4 is an example of an echo text classifier architecture.

FIG. 5 is a diagram illustrating the effect of different graph embedding methods on the results of the spoofing detection.

FIG. 6 is a graph of the impact of social information and fusion vector dimensions on detection efficiency.

FIG. 7 is a diagram illustrating the impact of different word embedding techniques on the detection of spoofing.

Detailed Description

Referring to fig. 3, in a network spoofing detection method, a network spoofing detection task is a text classification task. The social network is represented as G (U, R), the node set U represents a set of users, and the edge set R represents a set of attention relationships between users. The set of unlabeled short texts published by all users in G is denoted as S. Order to

And representing a label category set, wherein k is the number of label categories. Each text in S can be assigned one and only one category label. Order to

Representing a user U_pPublished text S_hE set of words in S, where l is the short text S_hLength of (d). Each short text in S belongs to only one user.

As can be appreciated from the above, the network spoofing detection task may be embodied as an identify S_hWhether it is an deceptive text. The method is specifically divided into three steps, 1) representing a user U according to a social network G (U, R)_pThe social information of (1); 2) accurately represents W_hEach word in (1), in particular a sparse vocabulary; 3) fusing the social information representation and the text information representation obtained in step 1) and step 2), and assigning S based thereon_hCorrect text label to identify whether it is a deceptive text.

In the step 1), firstly, a group of node sequences is obtained through a random walk technology according to a social network G, and the sequences contain social information of a user. In the random walk process, a node U is sampled randomly_iAnd using it as root node, then randomly sampling a U_iNeighbor node U of_j. Then will U_jAs a root node, the process is repeated until a preset threshold of sampling times is reached. After the random walk process is completed, the social information representation of each user (node) is learned using the Skip-Gram algorithm.

In step 2), as shown in fig. 1, "fcukk" and "fxxk" are "intentional fuzzy words" which are sparse and have a small amount of context information, and they are regarded as two words in a conventional processing manner. In fact, they express the same meaning and their First order neighborhoods (First-order neighbors) have a very high degree of similarity, namely { fcukk: you, locking, shit, killing } and { fxxk: you, locking, shifting }. In order to effectively represent the 'deliberate fuzzy words' with the same semantics, the application designs a word embedding technology based on similarity, so that the 'deliberate fuzzy words' and the corresponding original words have similar representations.

For a given corpus, the similarity-based word embedding technique generates, in order: (1) a co-occurrence matrix

Where d is the size of the corpus; (2) sparse vocabulary list in corpus

In this text, we consider words with a word frequency below a predefined threshold b as sparse words; and (3) a second-order neighborhood list for each sparse word

wherein

Is the sparse word O_iLength of the second order neighborhood list. Then, a similarity matrix can be obtained according to the co-occurrence matrix C

The calculation formula is shown as formula (1):

wherein ,

representing a standard Euclidean distance; f-maxfri.

The similarity-based word embedding technique is based entirely on a self-coder model. As shown in fig. 2, the self-Encoder includes an Encoder (Encoder) and a Decoder (Decoder). The similarity-based word embedding technology adopts two layers of fully-connected neural networks as encoders, the input of the encoders is a co-occurrence matrix C, and the output of the encoders is a text representation matrix L. Unlike the conventional self-encoder model that reconstructs only the co-occurrence matrix, the decoder of the similarity-based word-embedding technique does notOnly the co-occurrence matrix C needs to be reconstructed and also the similarity matrix S needs to be reconstructed. Similarity-based word embedding technique uses another two-layer fully-connected neural network as its decoder, which needs to generate a reconstructed co-occurrence matrix

And generating a reconstruction similarity matrix according to the formula (2)

The training process of the similarity-based word embedding technique can be expressed as:

in step 3), in the spoofing text detector module, the social information representation and the text information representation learned in step 1) and step 2) are fused, and a fusion vector is generated in a splicing manner, wherein the dimensionality of the fusion vector is the maximum text length plus one (representing the user U)_p). The first line of the fused vector is a social information representation of the user who posted the short text; each subsequent row is a representation of the corresponding order of words in the sentence. And if the length of the sentence is less than the maximum sentence length, splicing the zero vector at the end of the fusion vector.

The fused vector is then input into an octopus classifier to identify whether the short text is an octopus. As shown in fig. 4, the deception text classifier is based on a neural network structure. We use a Bidirectional long short-term memory network (BLSTM) as the Detector (Detector). Unlike a Recurrent Neural Network (RNN) which can only acquire unidirectional context information, BLSTM can capture bidirectional context information. In the classifier, the size of the Input Layer is maxl + 1; output Layer has k neurons representing k text classes. The Softmax function is used as the activation function. The drop rates of Dropout Layer 1 and Dropout Layer 2 are set to 0.25 and 0.5, respectively.

The validity of the detection method is verified by combining the real data.

We used four authentic data sets (Datasets) to validate the SocialBully, the data set information being shown in table 1. These data sets are generated from random sampling of the Twitter data set. For users who appear in the data set, we obtained their attention relationship through the API provided by Twitter in 2018, 9, 10.

For each short text in the dataset, the specification is performed by data preprocessing. The data preprocessing process is as follows: (1) remove special characters, including! @ # $% & () - + ═ |' [ ] { }; ', < >/? (ii) a (2) Removing the webpage link; (3) removing stop words (StopWords) according to a stop word stock in a Natural Language Toolkit (NLTK); and (4) reducing the influence caused by word change through a stem extraction (Stemming) operation.

Table 1: four Twitter dataset information

We use TensorFlow and Keras to implement SimWord and an deception detector (cyberbellyingdetector), respectively. Categorical cross Entropy (CategoricalCross-entry) is used as a loss function for the deception detector. Each set of experiments was repeated 5 times and the average results were compared as the final results. In each experiment, we randomly selected 80% of the data as the training data set (TrainingDataset) and the rest as the test data set (TestDataset). We chose Precision (Precision), Recall (Recall) and F1 values (F1Score) as evaluation indices. The predefined threshold b for sparse vocabulary is 2.

All experiments were performed on a personal computer configured as macOS Mojave, 2.5ghz intel corei7 and 16GB memory.

1.1 comparison of the network spoofing detection method of the present invention and the conventional network spoofing detection method

We compare the network spoofing detection method of the present invention with two latest network spoofing detection methods: (1) the network spoofing detection algorithm framework proposed by Agrawal and Awekar, which uses deep learning-Based detectors (deep-Based detectors) and word embedding techniques to detect spoofing text, but which does not take social information into account; (2) pitsiliis et al propose a network spoofing detection method [32] that uses Long Short-Term Memory network (LSTM) as a detector to detect spoofing in combination with a user's emotional propensity to discriminate between ethnicity and gender.

Table 2 shows the F1 values for the different methods of spoofing detection on the four datasets, where the bold font indicates the best detection result on each dataset. Experimental results show that the detection performance of the SocialBully is superior to that of the other two detection methods in most cases. In comparison to the Agrawal and Awekar methods, SocialBully has an F1 value greater than 0.9 on all four datasets, significantly higher than their detection methods. In addition, we also exchanged the detector from BLSTM to LSTM and compared it to the method of Pitsilis et al. As a result, the network spoofing detection method of the present invention is superior to their method in many cases, as shown in table 3. This further proves that the fusion of social information and text information learned by SimWord (similarity-based word embedding technology) can effectively improve the effect of detecting the deceptive text.

Table 2: f1 values on four datasets for different network spoofing detection methods

1.2 impact analysis of social information on network spoofing detection

To compare the impact of different graph embedding techniques on the detection of spoofing, we selected the following three representative graph embedding techniques for experiments: (1) node2 vec. The technique samples the neighborhood of the root node using random walks. In the experiment, a parameter p is set to be 1.5 and 0.5, which respectively represent that a width-first sampling strategy is adopted and a depth-first sampling strategy is adopted; (2) GraRep. The technology learns the potential node representation in a matrix decomposition mode, and can capture the global structure information of the graph; and (3) Deepwalk.

Fig. 5 illustrates the impact of different graph embedding techniques on the network spoofing detection result. The detection method with the added social information is represented as blue, and the detection method without the added social information is represented as yellow. It can be found that the detection method with the added social information is superior to the detection method without the added social information in three indexes. This demonstrates that adding social information is beneficial for improving network spoofing detection performance.

Among all the detection methods that add social information, the DeepWalk-based detection method performs best on four data sets, followed by Node2 vec-based (q ═ 0.5), Node2 vec-based (q ═ 1.5), and GraRep-based detection methods in this order. It is noted that Deepwalk may be considered a special type of Node2vec (i.e., a Node2vec when the parameters p and q are set to 1 at the same time).

In addition, we further investigated the impact of social information on the efficiency of the spoofing detection. Fig. 6 shows the F1 value and the running time (ExecutionTime) for the spoofing detection using SocialBully on dataset 3. It can be seen that there is an approximately linear relationship between the run time and the Dimension of the fused vector (fused vector's Dimension). This is because the increase in the dimension of the fused vector causes an increase in the parameters in the spoofed text detector, which requires more computation time. In addition, we also found that SocialBully can achieve better detection results in a low dimension than the case where no social information is added. For example, in the case of no social information added, the fusion vector dimension is 19, and the F1 value can reach 0.85 after 23.3 seconds of operation; after the social information is added, the dimension of the fusion vector is reduced to 7, and the running time is saved by 8.6%. These results further verify that the user social information has a positive impact on the efficiency of network spoofing detection.

1.3 validation of Simword (similarity-based word embedding technique/method) in the task of deception detection

We compared the effect of SimWord and two other representative word embedding techniques on the method of fraud detection: word2vec and GloVe. The former has the characteristic of high training speed and is one of the most common word embedding technologies at present; the latter uses a global logarithmic bilinear regression model (globallergo-bilineargression model) for unsupervised word representation learning.

Fig. 7 compares the impact of different word embedding techniques on the detection of spoofing. SimWord is superior to other methods in all three indexes, especially F1 value. But this advantage is not evident on data sets 3 and 4. This is because the detector may capture more potential features because more spoofed samples are included in the two data sets, resulting in less of a role of the textual information representation in network spoofing detection. In the other two data sets, the potential features that the detector can capture are limited, so the text information representation has more significant effect on the task of the spoofing detection. This also suggests that SimWord may alleviate the imbalance-like problem to some extent.

1.4 contrasting the effects of different spoofed text detectors in the task of spoofing detection

We validated the effectiveness of BLSTM-based detectors by comparing experiments with two (six) classes of classical network spoofing detectors: (1) conventional machine learning-based detectors (statistical Regression, LR), Support Vector Machine (SVM), Random Forest (RF) and naive bayes (r: (r) ("r"))

Bayes, NB); and (2) deep learning-based detectors (deep-learning-based detectors) including Long-Short-Term Memory networks (LSTM) and Bidirectional Long-Short-Term Memory networks (BLSTM _ att) combined with Attention mechanisms. Wherein the LSTM can only capture one-way text information; the attention mechanism in blstmatt may better store dependencies between states at different points in time.

Table 3 compares the F1 values for the seven detectors on the four data sets, with the bold font indicating the best results achieved on each data set. Experimental results show that the detector based on deep learning has better performance than the traditional detector based on machine learning. When using a deep learning based detector almost all F1 values were above 0.9, whereas the best result on a conventional machine learning based detector was only 0.8748. On data set 1 and data set 4, the BLSTM-based detector adopted by SocialBully works best; whereas the BLSTM att based detector performed slightly better than the BLSTM based detector on the other two data sets.

Table 3: f1 values for different detectors on four data sets

The experimental results of the network spoofing detection method on the four real data sets show that the network spoofing detection method has better detection performance compared with other network spoofing detection methods. Meanwhile, the social information is verified by an experimental result, so that the accuracy of detecting the deception text can be effectively improved.

The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above embodiment, and all technical solutions belonging to the principle of the present invention belong to the protection scope of the present invention. Modifications to the disclosed embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention.

Claims

1. A method for network spoofing detection, comprising the steps of:

wherein the social network is represented as G (U, R), sectionThe point set U represents a user set, the edge set R represents an attention relationship set among users, a set of unmarked short texts issued by all the users in the G is represented as S, and the order is given

2. The method of claim 1, wherein the network spoofing detection comprises: in the step 1), the step (A) is carried out,

according to the social network G, a group of node sequences are obtained through random walk, the sequences contain social information of users, and in the random walk process, one node U is sampled randomly_iAnd using it as root node, then randomly sampling a U_iNeighbor node U of_jNext to U_jAs a root node, repeating the process until a preset sampling frequency threshold value is reached;

after the random walk process is completed, the social information representation of each user node is learned using the Skip-Gram algorithm.

3. The method of claim 1, wherein the network spoofing detection comprises: in the step 2), the step (c) is carried out,

for a given corpus, a word embedding method based on word co-occurrence vector similarity sequentially generates: (1) a co-occurrence matrix

Where d is the size of the corpus; (2) sparse vocabulary list in corpus

wherein

The calculation formula is shown as formula (1);

wherein ,

representing a standard Euclidean distance; f max fri;

And generating a reconstruction similarity matrix according to the formula (2)

4. the method of claim 1, wherein the network spoofing detection comprises: in the step 3), the step (c),

fusing the social information representation and the text information representation learned in the steps 1) and 2), and generating a fusion vector in a splicing mode, wherein the dimensionality of the fusion vector is the maximum text length plus one (representing the user U)_p) The first line of the fusion vector is represented by social information of a user who issues the short text, each subsequent line is represented by words in a sentence in a corresponding sequence, and if the length of the sentence is smaller than the maximum sentence length, a zero vector is spliced at the end of the fusion vector;

the fused vector is input into an octopus classifier to identify whether the short text is an octopus.

5. The method of claim 4, wherein the network spoofing detection comprises: the deception text classifier is based on a neural network structure, a bidirectional long-short term memory network is used as a detector, and the size of an Input Layer in the classifier is max l + 1; output Layer has k neurons representing k text classes, and the drop rates of Dropout Layer 1 and Dropout Layer 2 are set to 0.25 and 0.5, respectively, using the Softmax function as the activation function.