CN111274403A - Network spoofing detection method - Google Patents

Network spoofing detection method Download PDF

Info

Publication number
CN111274403A
CN111274403A CN202010083486.5A CN202010083486A CN111274403A CN 111274403 A CN111274403 A CN 111274403A CN 202010083486 A CN202010083486 A CN 202010083486A CN 111274403 A CN111274403 A CN 111274403A
Authority
CN
China
Prior art keywords
text
word
network
user
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010083486.5A
Other languages
Chinese (zh)
Other versions
CN111274403B (en
Inventor
赵泽华
高旻
罗逢吉
王润生
钟将
吴映波
熊庆宇
文俊浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University
Original Assignee
Chongqing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University filed Critical Chongqing University
Priority to CN202010083486.5A priority Critical patent/CN111274403B/en
Publication of CN111274403A publication Critical patent/CN111274403A/en
Application granted granted Critical
Publication of CN111274403B publication Critical patent/CN111274403B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a network spoofing detection method, which comprises the following steps: 1) representing a user U according to a social network G (U, R)pThe social information of (1); 2) accurately represents WhEach word in (1), in particular a sparse vocabulary; 3) fusing the social information representation and the text information representation obtained in step 1) and step 2), and assigning S based thereonhThe correct text label; the social network is represented as G (U, R), the node set U represents a user set, the edge set R represents an attention relationship set among users, a set of unmarked short texts published by all the users in G is represented as S, and the order is given
Figure DDA0002381185580000011
Representing a label category set, wherein k is the number of label categories, each text in S can be endowed with only one category label, and the order
Figure DDA0002381185580000012
Representing a user UpPublished text ShE set of words in S, where l is the short text ShEach short text in S belongs to only one user. The method has better detection performance, and can effectively improve the accuracy of the detection of the deception text.

Description

Network spoofing detection method
Technical Field
The invention relates to the technical field of network security, in particular to a network spoofing detection method.
Background
Network spoofing detection algorithms in the prior art can generally be divided into two categories: text-based detection methods and multimodal-based detection methods. The detection method based on the text only uses the text information for detection, and is the most common network spoofing detection method. Most of the recent text-based detection methods adopt a word embedding technology to perform text representation learning, and the main advantage is that the detection efficiency is greatly improved through low-dimensional word representation. However, word embedding techniques still have limitations in dealing with "intentionally ambiguous words" in network spoofed text. By "intentionally obscuring words" is meant that in the event of network spoofing, the spoofing information issuer artificially obscures some words for the purpose of avoiding system detection. This purposefully created deceptive vocabulary is very sparse in the corpus. Conventional word embedding techniques typically remove sparse words directly during the preprocessing stage or represent them by random vectors. Considering that these sparse "intentionally ambiguous words" often play a crucial role in the task of detecting spoofing, conventional word embedding techniques do not work well for network spoofing detection tasks.
In addition, in some cases, detecting network spoofing events also requires consideration of context information, rather than merely parsing the text itself. For example, if "you should wear a skirt" is sent to a girl, it can be regarded as a normal expression; but if it is sent to the boy, the probability is a deceptive text. To address this problem, researchers have proposed multi-modal based detection methods. Such methods incorporate additional information (e.g., gender, age, and educational status) of the text to be detected to improve detection performance. However, the additional information is usually private data of the user and is difficult to obtain. Therefore, many multi-modality based detection methods choose to use meta-information (e.g., a picture of a tweet) as the additional information. These meta-information can be easily obtained through an API provided by the social platform, in contrast to the user's private data. However, some studies have shown that most meta-information is not valid in the issue of spoofing detection.
Disclosure of Invention
The present invention is directed to solving the above problems and providing a network spoofing detecting method with better effectiveness.
In order to achieve the purpose, the invention adopts the following technical scheme: a method of network spoofing detection comprising the steps of:
1) representing a user U according to a social network G (U, R)pThe social information of (1);
2) accurately represents WhEach word in (1), in particular a sparse vocabulary;
3) fusing the social information representation and the text information representation obtained in step 1) and step 2), and assigning S based thereonhThe correct text label;
the social network is represented as G (U, R), the node set U represents a user set, the edge set R represents an attention relationship set among users, a set of unmarked short texts published by all the users in G is represented as S, and the order is given
Figure BDA0002381185560000021
Representing a label category set, wherein k is the number of label categories, each text in S can be endowed with only one category label, and the order
Figure BDA0002381185560000022
Representing a user UpPublished text ShE set of words in S, where l is the short text ShEach short text in S belongs to only one user.
Further, in the step 1), according to the social network G, a group of node sequences is obtained through random walk, the sequences contain social information of the user, and in the random walk process, one node U is sampled randomlyiAnd using it as root node, then randomly sampling a UiNeighbor node U ofjNext to UjAs a root node, repeating the process until a preset sampling frequency threshold value is reached; after the random walk process is completed, the social information representation of each user node is learned using the Skip-Gram algorithm.
Further, in step 2), for a given corpus, the word embedding method based on word co-occurrence vector similarity sequentially generates: (1) a co-occurrence matrix
Figure BDA0002381185560000023
Where d is the size of the corpus; (2) sparse vocabulary list in corpus
Figure BDA0002381185560000024
Taking words with the word frequency lower than a predefined threshold value b as sparse words; (3) second order neighborhood list for each sparse word
Figure BDA0002381185560000025
wherein
Figure BDA0002381185560000026
Is the sparse word OiThe length of the second-order neighborhood list, and then obtaining a similarity matrix according to the co-occurrence matrix C
Figure BDA0002381185560000027
The calculation formula is shown as formula (1);
Figure BDA0002381185560000028
wherein ,
Figure BDA0002381185560000029
representing a standard Euclidean distance; f ═ maxfri;
the word embedding method based on the similarity of the word co-occurrence vectors is based on a self-encoder model, the self-encoder comprises an encoder and a decoder, the word embedding method based on the similarity of the word co-occurrence vectors adopts a two-layer fully-connected neural network as the encoder, the input of the encoder is a co-occurrence matrix C, the output of the encoder is a text representation matrix L, the decoder of the word embedding method based on the similarity of the word co-occurrence vectors needs to reconstruct the co-occurrence matrix C and a similarity matrix S, the word embedding method based on the similarity of the word co-occurrence vectors uses the other two-layer fully-connected neural network as the decoder, and the decoder needs to generate a reconstructed co-occurrence matrix
Figure BDA00023811855600000210
And generating a reconstruction similarity matrix according to the formula (2)
Figure BDA00023811855600000211
Figure BDA00023811855600000212
The training process of the word embedding method based on word co-occurrence vector similarity is expressed as follows:
Figure BDA00023811855600000213
further, in the step 3), the social information representation and the text information representation learned in the step 1) and the step 2) are fused, and a fusion vector is generated in a splicing mode, wherein the dimension of the fusion vector isIs the maximum text length plus one (representing the user U)p) The first line of the fusion vector is represented by social information of a user who issues the short text, each subsequent line is represented by words in a sentence in a corresponding sequence, and if the length of the sentence is smaller than the maximum sentence length, a zero vector is spliced at the end of the fusion vector; the fused vector is input into an octopus classifier to identify whether the short text is an octopus.
Further, the deception text classifier is based on a neural network structure, a bidirectional long-short term memory network is used as a detector, and the size of an Input Layer in the classifier is maxl + 1; output Layer has k neurons representing k text classes, and the droprates of Dropout Layer 1 and Dropout Layer 2 are set to 0.25 and 0.5, respectively, using the Softmax function as the activation function.
Compared with the prior art, the invention has the following beneficial effects: the method has better detection performance, and the social information can effectively improve the accuracy of the deception text detection.
Drawings
Fig. 1 is an example of an intentional fuzzy word.
Fig. 2 is a diagram of a word embedding technique architecture based on similarity.
Fig. 3 is a general architecture diagram of a network spoofing detection method.
Figure 4 is an example of an echo text classifier architecture.
FIG. 5 is a diagram illustrating the effect of different graph embedding methods on the results of the spoofing detection.
FIG. 6 is a graph of the impact of social information and fusion vector dimensions on detection efficiency.
FIG. 7 is a diagram illustrating the impact of different word embedding techniques on the detection of spoofing.
Detailed Description
Referring to fig. 3, in a network spoofing detection method, a network spoofing detection task is a text classification task. The social network is represented as G (U, R), the node set U represents a set of users, and the edge set R represents a set of attention relationships between users. The set of unlabeled short texts published by all users in G is denoted as S. Order to
Figure BDA0002381185560000031
And representing a label category set, wherein k is the number of label categories. Each text in S can be assigned one and only one category label. Order to
Figure BDA0002381185560000032
Representing a user UpPublished text ShE set of words in S, where l is the short text ShLength of (d). Each short text in S belongs to only one user.
As can be appreciated from the above, the network spoofing detection task may be embodied as an identify ShWhether it is an deceptive text. The method is specifically divided into three steps, 1) representing a user U according to a social network G (U, R)pThe social information of (1); 2) accurately represents WhEach word in (1), in particular a sparse vocabulary; 3) fusing the social information representation and the text information representation obtained in step 1) and step 2), and assigning S based thereonhCorrect text label to identify whether it is a deceptive text.
In the step 1), firstly, a group of node sequences is obtained through a random walk technology according to a social network G, and the sequences contain social information of a user. In the random walk process, a node U is sampled randomlyiAnd using it as root node, then randomly sampling a UiNeighbor node U ofj. Then will UjAs a root node, the process is repeated until a preset threshold of sampling times is reached. After the random walk process is completed, the social information representation of each user (node) is learned using the Skip-Gram algorithm.
In step 2), as shown in fig. 1, "fcukk" and "fxxk" are "intentional fuzzy words" which are sparse and have a small amount of context information, and they are regarded as two words in a conventional processing manner. In fact, they express the same meaning and their First order neighborhoods (First-order neighbors) have a very high degree of similarity, namely { fcukk: you, locking, shit, killing } and { fxxk: you, locking, shifting }. In order to effectively represent the 'deliberate fuzzy words' with the same semantics, the application designs a word embedding technology based on similarity, so that the 'deliberate fuzzy words' and the corresponding original words have similar representations.
For a given corpus, the similarity-based word embedding technique generates, in order: (1) a co-occurrence matrix
Figure BDA0002381185560000041
Where d is the size of the corpus; (2) sparse vocabulary list in corpus
Figure BDA0002381185560000042
In this text, we consider words with a word frequency below a predefined threshold b as sparse words; and (3) a second-order neighborhood list for each sparse word
Figure BDA0002381185560000043
wherein
Figure BDA0002381185560000044
Is the sparse word OiLength of the second order neighborhood list. Then, a similarity matrix can be obtained according to the co-occurrence matrix C
Figure BDA0002381185560000045
The calculation formula is shown as formula (1):
Figure BDA0002381185560000046
wherein ,
Figure BDA0002381185560000047
representing a standard Euclidean distance; f-maxfri.
The similarity-based word embedding technique is based entirely on a self-coder model. As shown in fig. 2, the self-Encoder includes an Encoder (Encoder) and a Decoder (Decoder). The similarity-based word embedding technology adopts two layers of fully-connected neural networks as encoders, the input of the encoders is a co-occurrence matrix C, and the output of the encoders is a text representation matrix L. Unlike the conventional self-encoder model that reconstructs only the co-occurrence matrix, the decoder of the similarity-based word-embedding technique does notOnly the co-occurrence matrix C needs to be reconstructed and also the similarity matrix S needs to be reconstructed. Similarity-based word embedding technique uses another two-layer fully-connected neural network as its decoder, which needs to generate a reconstructed co-occurrence matrix
Figure BDA0002381185560000048
And generating a reconstruction similarity matrix according to the formula (2)
Figure BDA0002381185560000049
Figure BDA00023811855600000410
The training process of the similarity-based word embedding technique can be expressed as:
Figure BDA00023811855600000411
in step 3), in the spoofing text detector module, the social information representation and the text information representation learned in step 1) and step 2) are fused, and a fusion vector is generated in a splicing manner, wherein the dimensionality of the fusion vector is the maximum text length plus one (representing the user U)p). The first line of the fused vector is a social information representation of the user who posted the short text; each subsequent row is a representation of the corresponding order of words in the sentence. And if the length of the sentence is less than the maximum sentence length, splicing the zero vector at the end of the fusion vector.
The fused vector is then input into an octopus classifier to identify whether the short text is an octopus. As shown in fig. 4, the deception text classifier is based on a neural network structure. We use a Bidirectional long short-term memory network (BLSTM) as the Detector (Detector). Unlike a Recurrent Neural Network (RNN) which can only acquire unidirectional context information, BLSTM can capture bidirectional context information. In the classifier, the size of the Input Layer is maxl + 1; output Layer has k neurons representing k text classes. The Softmax function is used as the activation function. The drop rates of Dropout Layer 1 and Dropout Layer 2 are set to 0.25 and 0.5, respectively.
The validity of the detection method is verified by combining the real data.
We used four authentic data sets (Datasets) to validate the SocialBully, the data set information being shown in table 1. These data sets are generated from random sampling of the Twitter data set. For users who appear in the data set, we obtained their attention relationship through the API provided by Twitter in 2018, 9, 10.
For each short text in the dataset, the specification is performed by data preprocessing. The data preprocessing process is as follows: (1) remove special characters, including! @ # $% & () - + ═ |' [ ] { }; ', < >/? (ii) a (2) Removing the webpage link; (3) removing stop words (StopWords) according to a stop word stock in a Natural Language Toolkit (NLTK); and (4) reducing the influence caused by word change through a stem extraction (Stemming) operation.
Table 1: four Twitter dataset information
Figure BDA0002381185560000051
We use TensorFlow and Keras to implement SimWord and an deception detector (cyberbellyingdetector), respectively. Categorical cross Entropy (CategoricalCross-entry) is used as a loss function for the deception detector. Each set of experiments was repeated 5 times and the average results were compared as the final results. In each experiment, we randomly selected 80% of the data as the training data set (TrainingDataset) and the rest as the test data set (TestDataset). We chose Precision (Precision), Recall (Recall) and F1 values (F1Score) as evaluation indices. The predefined threshold b for sparse vocabulary is 2.
All experiments were performed on a personal computer configured as macOS Mojave, 2.5ghz intel corei7 and 16GB memory.
1.1 comparison of the network spoofing detection method of the present invention and the conventional network spoofing detection method
We compare the network spoofing detection method of the present invention with two latest network spoofing detection methods: (1) the network spoofing detection algorithm framework proposed by Agrawal and Awekar, which uses deep learning-Based detectors (deep-Based detectors) and word embedding techniques to detect spoofing text, but which does not take social information into account; (2) pitsiliis et al propose a network spoofing detection method [32] that uses Long Short-Term Memory network (LSTM) as a detector to detect spoofing in combination with a user's emotional propensity to discriminate between ethnicity and gender.
Table 2 shows the F1 values for the different methods of spoofing detection on the four datasets, where the bold font indicates the best detection result on each dataset. Experimental results show that the detection performance of the SocialBully is superior to that of the other two detection methods in most cases. In comparison to the Agrawal and Awekar methods, SocialBully has an F1 value greater than 0.9 on all four datasets, significantly higher than their detection methods. In addition, we also exchanged the detector from BLSTM to LSTM and compared it to the method of Pitsilis et al. As a result, the network spoofing detection method of the present invention is superior to their method in many cases, as shown in table 3. This further proves that the fusion of social information and text information learned by SimWord (similarity-based word embedding technology) can effectively improve the effect of detecting the deceptive text.
Table 2: f1 values on four datasets for different network spoofing detection methods
Figure BDA0002381185560000061
1.2 impact analysis of social information on network spoofing detection
To compare the impact of different graph embedding techniques on the detection of spoofing, we selected the following three representative graph embedding techniques for experiments: (1) node2 vec. The technique samples the neighborhood of the root node using random walks. In the experiment, a parameter p is set to be 1.5 and 0.5, which respectively represent that a width-first sampling strategy is adopted and a depth-first sampling strategy is adopted; (2) GraRep. The technology learns the potential node representation in a matrix decomposition mode, and can capture the global structure information of the graph; and (3) Deepwalk.
Fig. 5 illustrates the impact of different graph embedding techniques on the network spoofing detection result. The detection method with the added social information is represented as blue, and the detection method without the added social information is represented as yellow. It can be found that the detection method with the added social information is superior to the detection method without the added social information in three indexes. This demonstrates that adding social information is beneficial for improving network spoofing detection performance.
Among all the detection methods that add social information, the DeepWalk-based detection method performs best on four data sets, followed by Node2 vec-based (q ═ 0.5), Node2 vec-based (q ═ 1.5), and GraRep-based detection methods in this order. It is noted that Deepwalk may be considered a special type of Node2vec (i.e., a Node2vec when the parameters p and q are set to 1 at the same time).
In addition, we further investigated the impact of social information on the efficiency of the spoofing detection. Fig. 6 shows the F1 value and the running time (ExecutionTime) for the spoofing detection using SocialBully on dataset 3. It can be seen that there is an approximately linear relationship between the run time and the Dimension of the fused vector (fused vector's Dimension). This is because the increase in the dimension of the fused vector causes an increase in the parameters in the spoofed text detector, which requires more computation time. In addition, we also found that SocialBully can achieve better detection results in a low dimension than the case where no social information is added. For example, in the case of no social information added, the fusion vector dimension is 19, and the F1 value can reach 0.85 after 23.3 seconds of operation; after the social information is added, the dimension of the fusion vector is reduced to 7, and the running time is saved by 8.6%. These results further verify that the user social information has a positive impact on the efficiency of network spoofing detection.
1.3 validation of Simword (similarity-based word embedding technique/method) in the task of deception detection
We compared the effect of SimWord and two other representative word embedding techniques on the method of fraud detection: word2vec and GloVe. The former has the characteristic of high training speed and is one of the most common word embedding technologies at present; the latter uses a global logarithmic bilinear regression model (globallergo-bilineargression model) for unsupervised word representation learning.
Fig. 7 compares the impact of different word embedding techniques on the detection of spoofing. SimWord is superior to other methods in all three indexes, especially F1 value. But this advantage is not evident on data sets 3 and 4. This is because the detector may capture more potential features because more spoofed samples are included in the two data sets, resulting in less of a role of the textual information representation in network spoofing detection. In the other two data sets, the potential features that the detector can capture are limited, so the text information representation has more significant effect on the task of the spoofing detection. This also suggests that SimWord may alleviate the imbalance-like problem to some extent.
1.4 contrasting the effects of different spoofed text detectors in the task of spoofing detection
We validated the effectiveness of BLSTM-based detectors by comparing experiments with two (six) classes of classical network spoofing detectors: (1) conventional machine learning-based detectors (statistical Regression, LR), Support Vector Machine (SVM), Random Forest (RF) and naive bayes (r: (r) ("r"))
Figure BDA0002381185560000071
Bayes, NB); and (2) deep learning-based detectors (deep-learning-based detectors) including Long-Short-Term Memory networks (LSTM) and Bidirectional Long-Short-Term Memory networks (BLSTM _ att) combined with Attention mechanisms. Wherein the LSTM can only capture one-way text information; the attention mechanism in blstmatt may better store dependencies between states at different points in time.
Table 3 compares the F1 values for the seven detectors on the four data sets, with the bold font indicating the best results achieved on each data set. Experimental results show that the detector based on deep learning has better performance than the traditional detector based on machine learning. When using a deep learning based detector almost all F1 values were above 0.9, whereas the best result on a conventional machine learning based detector was only 0.8748. On data set 1 and data set 4, the BLSTM-based detector adopted by SocialBully works best; whereas the BLSTM att based detector performed slightly better than the BLSTM based detector on the other two data sets.
Table 3: f1 values for different detectors on four data sets
Figure BDA0002381185560000081
The experimental results of the network spoofing detection method on the four real data sets show that the network spoofing detection method has better detection performance compared with other network spoofing detection methods. Meanwhile, the social information is verified by an experimental result, so that the accuracy of detecting the deception text can be effectively improved.
The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above embodiment, and all technical solutions belonging to the principle of the present invention belong to the protection scope of the present invention. Modifications to the disclosed embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention.

Claims (5)

1. A method for network spoofing detection, comprising the steps of:
1) representing a user U according to a social network G (U, R)pThe social information of (1);
2) accurately represents WhEach word in (1), in particular a sparse vocabulary;
3) fusing the social information representation and the text information representation obtained in step 1) and step 2), and assigning S based thereonhThe correct text label;
wherein the social network is represented as G (U, R), sectionThe point set U represents a user set, the edge set R represents an attention relationship set among users, a set of unmarked short texts issued by all the users in the G is represented as S, and the order is given
Figure FDA0002381185550000011
Representing a label category set, wherein k is the number of label categories, each text in S can be endowed with only one category label, and the order
Figure FDA0002381185550000012
Representing a user UpPublished text ShE set of words in S, where l is the short text ShEach short text in S belongs to only one user.
2. The method of claim 1, wherein the network spoofing detection comprises: in the step 1), the step (A) is carried out,
according to the social network G, a group of node sequences are obtained through random walk, the sequences contain social information of users, and in the random walk process, one node U is sampled randomlyiAnd using it as root node, then randomly sampling a UiNeighbor node U ofjNext to UjAs a root node, repeating the process until a preset sampling frequency threshold value is reached;
after the random walk process is completed, the social information representation of each user node is learned using the Skip-Gram algorithm.
3. The method of claim 1, wherein the network spoofing detection comprises: in the step 2), the step (c) is carried out,
for a given corpus, a word embedding method based on word co-occurrence vector similarity sequentially generates: (1) a co-occurrence matrix
Figure FDA0002381185550000013
Where d is the size of the corpus; (2) sparse vocabulary list in corpus
Figure FDA0002381185550000014
Taking words with the word frequency lower than a predefined threshold value b as sparse words; (3) second order neighborhood list for each sparse word
Figure FDA0002381185550000015
wherein
Figure FDA0002381185550000016
Is the sparse word OiThe length of the second-order neighborhood list, and then obtaining a similarity matrix according to the co-occurrence matrix C
Figure FDA0002381185550000017
The calculation formula is shown as formula (1);
Figure FDA0002381185550000018
wherein ,
Figure FDA0002381185550000019
representing a standard Euclidean distance; f max fri;
the word embedding method based on the similarity of the word co-occurrence vectors is based on a self-encoder model, the self-encoder comprises an encoder and a decoder, the word embedding method based on the similarity of the word co-occurrence vectors adopts a two-layer fully-connected neural network as the encoder, the input of the encoder is a co-occurrence matrix C, the output of the encoder is a text representation matrix L, the decoder of the word embedding method based on the similarity of the word co-occurrence vectors needs to reconstruct the co-occurrence matrix C and a similarity matrix S, the word embedding method based on the similarity of the word co-occurrence vectors uses the other two-layer fully-connected neural network as the decoder, and the decoder needs to generate a reconstructed co-occurrence matrix
Figure FDA00023811855500000110
And generating a reconstruction similarity matrix according to the formula (2)
Figure FDA00023811855500000111
Figure FDA00023811855500000112
The training process of the word embedding method based on word co-occurrence vector similarity is expressed as follows:
Figure FDA0002381185550000021
4. the method of claim 1, wherein the network spoofing detection comprises: in the step 3), the step (c),
fusing the social information representation and the text information representation learned in the steps 1) and 2), and generating a fusion vector in a splicing mode, wherein the dimensionality of the fusion vector is the maximum text length plus one (representing the user U)p) The first line of the fusion vector is represented by social information of a user who issues the short text, each subsequent line is represented by words in a sentence in a corresponding sequence, and if the length of the sentence is smaller than the maximum sentence length, a zero vector is spliced at the end of the fusion vector;
the fused vector is input into an octopus classifier to identify whether the short text is an octopus.
5. The method of claim 4, wherein the network spoofing detection comprises: the deception text classifier is based on a neural network structure, a bidirectional long-short term memory network is used as a detector, and the size of an Input Layer in the classifier is max l + 1; output Layer has k neurons representing k text classes, and the drop rates of Dropout Layer 1 and Dropout Layer 2 are set to 0.25 and 0.5, respectively, using the Softmax function as the activation function.
CN202010083486.5A 2020-02-09 2020-02-09 Network spoofing detection method Active CN111274403B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010083486.5A CN111274403B (en) 2020-02-09 2020-02-09 Network spoofing detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010083486.5A CN111274403B (en) 2020-02-09 2020-02-09 Network spoofing detection method

Publications (2)

Publication Number Publication Date
CN111274403A true CN111274403A (en) 2020-06-12
CN111274403B CN111274403B (en) 2023-04-25

Family

ID=71003558

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010083486.5A Active CN111274403B (en) 2020-02-09 2020-02-09 Network spoofing detection method

Country Status (1)

Country Link
CN (1) CN111274403B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111814454A (en) * 2020-07-10 2020-10-23 重庆大学 Multi-modal network spoofing detection model on social network
CN114036366A (en) * 2021-11-19 2022-02-11 四川大学 Social network deception detection method based on text semantics and hierarchical structure

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040064304A1 (en) * 2002-07-03 2004-04-01 Word Data Corp Text representation and method
US20130138428A1 (en) * 2010-01-07 2013-05-30 The Trustees Of The Stevens Institute Of Technology Systems and methods for automatically detecting deception in human communications expressed in digital form
US20190014071A1 (en) * 2016-10-13 2019-01-10 Tencent Technology (Shenzhen) Company Limited Network information identification method and apparatus
GB2572320A (en) * 2018-03-12 2019-10-02 Factmata Ltd Hate speech detection system for online media content
CN110704715A (en) * 2019-10-18 2020-01-17 南京航空航天大学 Network overlord ice detection method and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040064304A1 (en) * 2002-07-03 2004-04-01 Word Data Corp Text representation and method
US20130138428A1 (en) * 2010-01-07 2013-05-30 The Trustees Of The Stevens Institute Of Technology Systems and methods for automatically detecting deception in human communications expressed in digital form
US20190014071A1 (en) * 2016-10-13 2019-01-10 Tencent Technology (Shenzhen) Company Limited Network information identification method and apparatus
GB2572320A (en) * 2018-03-12 2019-10-02 Factmata Ltd Hate speech detection system for online media content
CN110704715A (en) * 2019-10-18 2020-01-17 南京航空航天大学 Network overlord ice detection method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
NEKTARIA POTHA等: "Cyberbullying Detection using Time Series Modeling", 《2014 IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOP》 *
仲丽君: "社交网络不良言论用户识别研究", 《中国优秀硕士学位论文全文数据库(信息科技辑)》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111814454A (en) * 2020-07-10 2020-10-23 重庆大学 Multi-modal network spoofing detection model on social network
CN111814454B (en) * 2020-07-10 2023-08-11 重庆大学 Multi-mode network spoofing detection model on social network
CN114036366A (en) * 2021-11-19 2022-02-11 四川大学 Social network deception detection method based on text semantics and hierarchical structure

Also Published As

Publication number Publication date
CN111274403B (en) 2023-04-25

Similar Documents

Publication Publication Date Title
US11669689B2 (en) Natural language generation using pinned text and multiple discriminators
Ren et al. A sentiment-aware deep learning approach for personality detection from text
US10437936B2 (en) Generative text using a personality model
Agarwal et al. Fake news detection: an ensemble learning approach
Shahi et al. A hybrid feature extraction method for Nepali COVID-19-related tweets classification
Daumé III et al. A large-scale exploration of effective global features for a joint entity detection and tracking model
Zheng et al. The email author identification system based on Support Vector Machine (SVM) and Analytic Hierarchy Process (AHP)
CN112214614B (en) Knowledge-graph-based risk propagation path mining method and system
CN112612871B (en) Multi-event detection method based on sequence generation model
Brocardo et al. Verifying online user identity using stylometric analysis for short messages
CN111274403B (en) Network spoofing detection method
Islam et al. Deep Learning for Multi-Labeled Cyberbully Detection: Enhancing Online Safety
CN113849597A (en) Illegal advertising word detection method based on named entity recognition
CN116049419A (en) Threat information extraction method and system integrating multiple models
Prachi et al. Detection of Fake News Using Machine Learning and Natural Language Processing Algorithms [J]
Ullah et al. A deep neural network-based approach for sentiment analysis of movie reviews
Hofmann et al. A graph auto-encoder model of derivational morphology
Meng et al. Predicting hate intensity of twitter conversation threads
Gôlo et al. One-class learning for fake news detection through multimodal variational autoencoders
Mahajan et al. EnsMulHateCyb: Multilingual hate speech and cyberbully detection in online social media
Zhao et al. Topic identification of text‐based expert stock comments using multi‐level information fusion
Sindhuja et al. Twitter Sentiment Analysis using Enhanced TF-DIF Naive Bayes Classifier Approach
Mathur et al. Analysis of Tweets for Cyberbullying Detection
Nisha et al. Detection and classification of cyberbullying in social media using text mining
Zhang et al. Sentiment identification by incorporating syntax, semantics and context information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant