CN111274403A - Network spoofing detection method - Google Patents
Network spoofing detection method Download PDFInfo
- Publication number
- CN111274403A CN111274403A CN202010083486.5A CN202010083486A CN111274403A CN 111274403 A CN111274403 A CN 111274403A CN 202010083486 A CN202010083486 A CN 202010083486A CN 111274403 A CN111274403 A CN 111274403A
- Authority
- CN
- China
- Prior art keywords
- text
- word
- network
- user
- similarity
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 74
- 238000000034 method Methods 0.000 claims abstract description 62
- 239000013598 vector Substances 0.000 claims description 37
- 239000011159 matrix material Substances 0.000 claims description 29
- 230000004927 fusion Effects 0.000 claims description 15
- 230000008569 process Effects 0.000 claims description 13
- 238000013528 artificial neural network Methods 0.000 claims description 10
- 238000005295 random walk Methods 0.000 claims description 10
- 238000005070 sampling Methods 0.000 claims description 9
- 230000006870 function Effects 0.000 claims description 7
- 241000238413 Octopus Species 0.000 claims description 6
- 230000002457 bidirectional effect Effects 0.000 claims description 5
- 238000004422 calculation algorithm Methods 0.000 claims description 5
- 230000015654 memory Effects 0.000 claims description 5
- 238000012549 training Methods 0.000 claims description 5
- 230000004913 activation Effects 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 claims description 3
- 210000002569 neuron Anatomy 0.000 claims description 3
- 238000005516 engineering process Methods 0.000 description 7
- 238000002474 experimental method Methods 0.000 description 6
- 230000000694 effects Effects 0.000 description 5
- 238000013135 deep learning Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 238000010801 machine learning Methods 0.000 description 3
- 238000007781 pre-processing Methods 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 238000007637 random forest analysis Methods 0.000 description 2
- 230000006403 short-term memory Effects 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- 241000764238 Isis Species 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 230000002996 emotional effect Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
- G06Q50/01—Social networking
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Abstract
The invention discloses a network spoofing detection method, which comprises the following steps: 1) representing a user U according to a social network G (U, R)pThe social information of (1); 2) accurately represents WhEach word in (1), in particular a sparse vocabulary; 3) fusing the social information representation and the text information representation obtained in step 1) and step 2), and assigning S based thereonhThe correct text label; the social network is represented as G (U, R), the node set U represents a user set, the edge set R represents an attention relationship set among users, a set of unmarked short texts published by all the users in G is represented as S, and the order is givenRepresenting a label category set, wherein k is the number of label categories, each text in S can be endowed with only one category label, and the orderRepresenting a user UpPublished text ShE set of words in S, where l is the short text ShEach short text in S belongs to only one user. The method has better detection performance, and can effectively improve the accuracy of the detection of the deception text.
Description
Technical Field
The invention relates to the technical field of network security, in particular to a network spoofing detection method.
Background
Network spoofing detection algorithms in the prior art can generally be divided into two categories: text-based detection methods and multimodal-based detection methods. The detection method based on the text only uses the text information for detection, and is the most common network spoofing detection method. Most of the recent text-based detection methods adopt a word embedding technology to perform text representation learning, and the main advantage is that the detection efficiency is greatly improved through low-dimensional word representation. However, word embedding techniques still have limitations in dealing with "intentionally ambiguous words" in network spoofed text. By "intentionally obscuring words" is meant that in the event of network spoofing, the spoofing information issuer artificially obscures some words for the purpose of avoiding system detection. This purposefully created deceptive vocabulary is very sparse in the corpus. Conventional word embedding techniques typically remove sparse words directly during the preprocessing stage or represent them by random vectors. Considering that these sparse "intentionally ambiguous words" often play a crucial role in the task of detecting spoofing, conventional word embedding techniques do not work well for network spoofing detection tasks.
In addition, in some cases, detecting network spoofing events also requires consideration of context information, rather than merely parsing the text itself. For example, if "you should wear a skirt" is sent to a girl, it can be regarded as a normal expression; but if it is sent to the boy, the probability is a deceptive text. To address this problem, researchers have proposed multi-modal based detection methods. Such methods incorporate additional information (e.g., gender, age, and educational status) of the text to be detected to improve detection performance. However, the additional information is usually private data of the user and is difficult to obtain. Therefore, many multi-modality based detection methods choose to use meta-information (e.g., a picture of a tweet) as the additional information. These meta-information can be easily obtained through an API provided by the social platform, in contrast to the user's private data. However, some studies have shown that most meta-information is not valid in the issue of spoofing detection.
Disclosure of Invention
The present invention is directed to solving the above problems and providing a network spoofing detecting method with better effectiveness.
In order to achieve the purpose, the invention adopts the following technical scheme: a method of network spoofing detection comprising the steps of:
1) representing a user U according to a social network G (U, R)pThe social information of (1);
2) accurately represents WhEach word in (1), in particular a sparse vocabulary;
3) fusing the social information representation and the text information representation obtained in step 1) and step 2), and assigning S based thereonhThe correct text label;
the social network is represented as G (U, R), the node set U represents a user set, the edge set R represents an attention relationship set among users, a set of unmarked short texts published by all the users in G is represented as S, and the order is givenRepresenting a label category set, wherein k is the number of label categories, each text in S can be endowed with only one category label, and the orderRepresenting a user UpPublished text ShE set of words in S, where l is the short text ShEach short text in S belongs to only one user.
Further, in the step 1), according to the social network G, a group of node sequences is obtained through random walk, the sequences contain social information of the user, and in the random walk process, one node U is sampled randomlyiAnd using it as root node, then randomly sampling a UiNeighbor node U ofjNext to UjAs a root node, repeating the process until a preset sampling frequency threshold value is reached; after the random walk process is completed, the social information representation of each user node is learned using the Skip-Gram algorithm.
Further, in step 2), for a given corpus, the word embedding method based on word co-occurrence vector similarity sequentially generates: (1) a co-occurrence matrixWhere d is the size of the corpus; (2) sparse vocabulary list in corpusTaking words with the word frequency lower than a predefined threshold value b as sparse words; (3) second order neighborhood list for each sparse word wherein Is the sparse word OiThe length of the second-order neighborhood list, and then obtaining a similarity matrix according to the co-occurrence matrix CThe calculation formula is shown as formula (1);
the word embedding method based on the similarity of the word co-occurrence vectors is based on a self-encoder model, the self-encoder comprises an encoder and a decoder, the word embedding method based on the similarity of the word co-occurrence vectors adopts a two-layer fully-connected neural network as the encoder, the input of the encoder is a co-occurrence matrix C, the output of the encoder is a text representation matrix L, the decoder of the word embedding method based on the similarity of the word co-occurrence vectors needs to reconstruct the co-occurrence matrix C and a similarity matrix S, the word embedding method based on the similarity of the word co-occurrence vectors uses the other two-layer fully-connected neural network as the decoder, and the decoder needs to generate a reconstructed co-occurrence matrixAnd generating a reconstruction similarity matrix according to the formula (2)
The training process of the word embedding method based on word co-occurrence vector similarity is expressed as follows:
further, in the step 3), the social information representation and the text information representation learned in the step 1) and the step 2) are fused, and a fusion vector is generated in a splicing mode, wherein the dimension of the fusion vector isIs the maximum text length plus one (representing the user U)p) The first line of the fusion vector is represented by social information of a user who issues the short text, each subsequent line is represented by words in a sentence in a corresponding sequence, and if the length of the sentence is smaller than the maximum sentence length, a zero vector is spliced at the end of the fusion vector; the fused vector is input into an octopus classifier to identify whether the short text is an octopus.
Further, the deception text classifier is based on a neural network structure, a bidirectional long-short term memory network is used as a detector, and the size of an Input Layer in the classifier is maxl + 1; output Layer has k neurons representing k text classes, and the droprates of Dropout Layer 1 and Dropout Layer 2 are set to 0.25 and 0.5, respectively, using the Softmax function as the activation function.
Compared with the prior art, the invention has the following beneficial effects: the method has better detection performance, and the social information can effectively improve the accuracy of the deception text detection.
Drawings
Fig. 1 is an example of an intentional fuzzy word.
Fig. 2 is a diagram of a word embedding technique architecture based on similarity.
Fig. 3 is a general architecture diagram of a network spoofing detection method.
Figure 4 is an example of an echo text classifier architecture.
FIG. 5 is a diagram illustrating the effect of different graph embedding methods on the results of the spoofing detection.
FIG. 6 is a graph of the impact of social information and fusion vector dimensions on detection efficiency.
FIG. 7 is a diagram illustrating the impact of different word embedding techniques on the detection of spoofing.
Detailed Description
Referring to fig. 3, in a network spoofing detection method, a network spoofing detection task is a text classification task. The social network is represented as G (U, R), the node set U represents a set of users, and the edge set R represents a set of attention relationships between users. The set of unlabeled short texts published by all users in G is denoted as S. Order toAnd representing a label category set, wherein k is the number of label categories. Each text in S can be assigned one and only one category label. Order toRepresenting a user UpPublished text ShE set of words in S, where l is the short text ShLength of (d). Each short text in S belongs to only one user.
As can be appreciated from the above, the network spoofing detection task may be embodied as an identify ShWhether it is an deceptive text. The method is specifically divided into three steps, 1) representing a user U according to a social network G (U, R)pThe social information of (1); 2) accurately represents WhEach word in (1), in particular a sparse vocabulary; 3) fusing the social information representation and the text information representation obtained in step 1) and step 2), and assigning S based thereonhCorrect text label to identify whether it is a deceptive text.
In the step 1), firstly, a group of node sequences is obtained through a random walk technology according to a social network G, and the sequences contain social information of a user. In the random walk process, a node U is sampled randomlyiAnd using it as root node, then randomly sampling a UiNeighbor node U ofj. Then will UjAs a root node, the process is repeated until a preset threshold of sampling times is reached. After the random walk process is completed, the social information representation of each user (node) is learned using the Skip-Gram algorithm.
In step 2), as shown in fig. 1, "fcukk" and "fxxk" are "intentional fuzzy words" which are sparse and have a small amount of context information, and they are regarded as two words in a conventional processing manner. In fact, they express the same meaning and their First order neighborhoods (First-order neighbors) have a very high degree of similarity, namely { fcukk: you, locking, shit, killing } and { fxxk: you, locking, shifting }. In order to effectively represent the 'deliberate fuzzy words' with the same semantics, the application designs a word embedding technology based on similarity, so that the 'deliberate fuzzy words' and the corresponding original words have similar representations.
For a given corpus, the similarity-based word embedding technique generates, in order: (1) a co-occurrence matrixWhere d is the size of the corpus; (2) sparse vocabulary list in corpusIn this text, we consider words with a word frequency below a predefined threshold b as sparse words; and (3) a second-order neighborhood list for each sparse word wherein Is the sparse word OiLength of the second order neighborhood list. Then, a similarity matrix can be obtained according to the co-occurrence matrix CThe calculation formula is shown as formula (1):
The similarity-based word embedding technique is based entirely on a self-coder model. As shown in fig. 2, the self-Encoder includes an Encoder (Encoder) and a Decoder (Decoder). The similarity-based word embedding technology adopts two layers of fully-connected neural networks as encoders, the input of the encoders is a co-occurrence matrix C, and the output of the encoders is a text representation matrix L. Unlike the conventional self-encoder model that reconstructs only the co-occurrence matrix, the decoder of the similarity-based word-embedding technique does notOnly the co-occurrence matrix C needs to be reconstructed and also the similarity matrix S needs to be reconstructed. Similarity-based word embedding technique uses another two-layer fully-connected neural network as its decoder, which needs to generate a reconstructed co-occurrence matrixAnd generating a reconstruction similarity matrix according to the formula (2)
The training process of the similarity-based word embedding technique can be expressed as:
in step 3), in the spoofing text detector module, the social information representation and the text information representation learned in step 1) and step 2) are fused, and a fusion vector is generated in a splicing manner, wherein the dimensionality of the fusion vector is the maximum text length plus one (representing the user U)p). The first line of the fused vector is a social information representation of the user who posted the short text; each subsequent row is a representation of the corresponding order of words in the sentence. And if the length of the sentence is less than the maximum sentence length, splicing the zero vector at the end of the fusion vector.
The fused vector is then input into an octopus classifier to identify whether the short text is an octopus. As shown in fig. 4, the deception text classifier is based on a neural network structure. We use a Bidirectional long short-term memory network (BLSTM) as the Detector (Detector). Unlike a Recurrent Neural Network (RNN) which can only acquire unidirectional context information, BLSTM can capture bidirectional context information. In the classifier, the size of the Input Layer is maxl + 1; output Layer has k neurons representing k text classes. The Softmax function is used as the activation function. The drop rates of Dropout Layer 1 and Dropout Layer 2 are set to 0.25 and 0.5, respectively.
The validity of the detection method is verified by combining the real data.
We used four authentic data sets (Datasets) to validate the SocialBully, the data set information being shown in table 1. These data sets are generated from random sampling of the Twitter data set. For users who appear in the data set, we obtained their attention relationship through the API provided by Twitter in 2018, 9, 10.
For each short text in the dataset, the specification is performed by data preprocessing. The data preprocessing process is as follows: (1) remove special characters, including! @ # $% & () - + ═ |' [ ] { }; ', < >/? (ii) a (2) Removing the webpage link; (3) removing stop words (StopWords) according to a stop word stock in a Natural Language Toolkit (NLTK); and (4) reducing the influence caused by word change through a stem extraction (Stemming) operation.
Table 1: four Twitter dataset information
We use TensorFlow and Keras to implement SimWord and an deception detector (cyberbellyingdetector), respectively. Categorical cross Entropy (CategoricalCross-entry) is used as a loss function for the deception detector. Each set of experiments was repeated 5 times and the average results were compared as the final results. In each experiment, we randomly selected 80% of the data as the training data set (TrainingDataset) and the rest as the test data set (TestDataset). We chose Precision (Precision), Recall (Recall) and F1 values (F1Score) as evaluation indices. The predefined threshold b for sparse vocabulary is 2.
All experiments were performed on a personal computer configured as macOS Mojave, 2.5ghz intel corei7 and 16GB memory.
1.1 comparison of the network spoofing detection method of the present invention and the conventional network spoofing detection method
We compare the network spoofing detection method of the present invention with two latest network spoofing detection methods: (1) the network spoofing detection algorithm framework proposed by Agrawal and Awekar, which uses deep learning-Based detectors (deep-Based detectors) and word embedding techniques to detect spoofing text, but which does not take social information into account; (2) pitsiliis et al propose a network spoofing detection method [32] that uses Long Short-Term Memory network (LSTM) as a detector to detect spoofing in combination with a user's emotional propensity to discriminate between ethnicity and gender.
Table 2 shows the F1 values for the different methods of spoofing detection on the four datasets, where the bold font indicates the best detection result on each dataset. Experimental results show that the detection performance of the SocialBully is superior to that of the other two detection methods in most cases. In comparison to the Agrawal and Awekar methods, SocialBully has an F1 value greater than 0.9 on all four datasets, significantly higher than their detection methods. In addition, we also exchanged the detector from BLSTM to LSTM and compared it to the method of Pitsilis et al. As a result, the network spoofing detection method of the present invention is superior to their method in many cases, as shown in table 3. This further proves that the fusion of social information and text information learned by SimWord (similarity-based word embedding technology) can effectively improve the effect of detecting the deceptive text.
Table 2: f1 values on four datasets for different network spoofing detection methods
1.2 impact analysis of social information on network spoofing detection
To compare the impact of different graph embedding techniques on the detection of spoofing, we selected the following three representative graph embedding techniques for experiments: (1) node2 vec. The technique samples the neighborhood of the root node using random walks. In the experiment, a parameter p is set to be 1.5 and 0.5, which respectively represent that a width-first sampling strategy is adopted and a depth-first sampling strategy is adopted; (2) GraRep. The technology learns the potential node representation in a matrix decomposition mode, and can capture the global structure information of the graph; and (3) Deepwalk.
Fig. 5 illustrates the impact of different graph embedding techniques on the network spoofing detection result. The detection method with the added social information is represented as blue, and the detection method without the added social information is represented as yellow. It can be found that the detection method with the added social information is superior to the detection method without the added social information in three indexes. This demonstrates that adding social information is beneficial for improving network spoofing detection performance.
Among all the detection methods that add social information, the DeepWalk-based detection method performs best on four data sets, followed by Node2 vec-based (q ═ 0.5), Node2 vec-based (q ═ 1.5), and GraRep-based detection methods in this order. It is noted that Deepwalk may be considered a special type of Node2vec (i.e., a Node2vec when the parameters p and q are set to 1 at the same time).
In addition, we further investigated the impact of social information on the efficiency of the spoofing detection. Fig. 6 shows the F1 value and the running time (ExecutionTime) for the spoofing detection using SocialBully on dataset 3. It can be seen that there is an approximately linear relationship between the run time and the Dimension of the fused vector (fused vector's Dimension). This is because the increase in the dimension of the fused vector causes an increase in the parameters in the spoofed text detector, which requires more computation time. In addition, we also found that SocialBully can achieve better detection results in a low dimension than the case where no social information is added. For example, in the case of no social information added, the fusion vector dimension is 19, and the F1 value can reach 0.85 after 23.3 seconds of operation; after the social information is added, the dimension of the fusion vector is reduced to 7, and the running time is saved by 8.6%. These results further verify that the user social information has a positive impact on the efficiency of network spoofing detection.
1.3 validation of Simword (similarity-based word embedding technique/method) in the task of deception detection
We compared the effect of SimWord and two other representative word embedding techniques on the method of fraud detection: word2vec and GloVe. The former has the characteristic of high training speed and is one of the most common word embedding technologies at present; the latter uses a global logarithmic bilinear regression model (globallergo-bilineargression model) for unsupervised word representation learning.
Fig. 7 compares the impact of different word embedding techniques on the detection of spoofing. SimWord is superior to other methods in all three indexes, especially F1 value. But this advantage is not evident on data sets 3 and 4. This is because the detector may capture more potential features because more spoofed samples are included in the two data sets, resulting in less of a role of the textual information representation in network spoofing detection. In the other two data sets, the potential features that the detector can capture are limited, so the text information representation has more significant effect on the task of the spoofing detection. This also suggests that SimWord may alleviate the imbalance-like problem to some extent.
1.4 contrasting the effects of different spoofed text detectors in the task of spoofing detection
We validated the effectiveness of BLSTM-based detectors by comparing experiments with two (six) classes of classical network spoofing detectors: (1) conventional machine learning-based detectors (statistical Regression, LR), Support Vector Machine (SVM), Random Forest (RF) and naive bayes (r: (r) ("r"))Bayes, NB); and (2) deep learning-based detectors (deep-learning-based detectors) including Long-Short-Term Memory networks (LSTM) and Bidirectional Long-Short-Term Memory networks (BLSTM _ att) combined with Attention mechanisms. Wherein the LSTM can only capture one-way text information; the attention mechanism in blstmatt may better store dependencies between states at different points in time.
Table 3 compares the F1 values for the seven detectors on the four data sets, with the bold font indicating the best results achieved on each data set. Experimental results show that the detector based on deep learning has better performance than the traditional detector based on machine learning. When using a deep learning based detector almost all F1 values were above 0.9, whereas the best result on a conventional machine learning based detector was only 0.8748. On data set 1 and data set 4, the BLSTM-based detector adopted by SocialBully works best; whereas the BLSTM att based detector performed slightly better than the BLSTM based detector on the other two data sets.
Table 3: f1 values for different detectors on four data sets
The experimental results of the network spoofing detection method on the four real data sets show that the network spoofing detection method has better detection performance compared with other network spoofing detection methods. Meanwhile, the social information is verified by an experimental result, so that the accuracy of detecting the deception text can be effectively improved.
The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above embodiment, and all technical solutions belonging to the principle of the present invention belong to the protection scope of the present invention. Modifications to the disclosed embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention.
Claims (5)
1. A method for network spoofing detection, comprising the steps of:
1) representing a user U according to a social network G (U, R)pThe social information of (1);
2) accurately represents WhEach word in (1), in particular a sparse vocabulary;
3) fusing the social information representation and the text information representation obtained in step 1) and step 2), and assigning S based thereonhThe correct text label;
wherein the social network is represented as G (U, R), sectionThe point set U represents a user set, the edge set R represents an attention relationship set among users, a set of unmarked short texts issued by all the users in the G is represented as S, and the order is givenRepresenting a label category set, wherein k is the number of label categories, each text in S can be endowed with only one category label, and the orderRepresenting a user UpPublished text ShE set of words in S, where l is the short text ShEach short text in S belongs to only one user.
2. The method of claim 1, wherein the network spoofing detection comprises: in the step 1), the step (A) is carried out,
according to the social network G, a group of node sequences are obtained through random walk, the sequences contain social information of users, and in the random walk process, one node U is sampled randomlyiAnd using it as root node, then randomly sampling a UiNeighbor node U ofjNext to UjAs a root node, repeating the process until a preset sampling frequency threshold value is reached;
after the random walk process is completed, the social information representation of each user node is learned using the Skip-Gram algorithm.
3. The method of claim 1, wherein the network spoofing detection comprises: in the step 2), the step (c) is carried out,
for a given corpus, a word embedding method based on word co-occurrence vector similarity sequentially generates: (1) a co-occurrence matrixWhere d is the size of the corpus; (2) sparse vocabulary list in corpusTaking words with the word frequency lower than a predefined threshold value b as sparse words; (3) second order neighborhood list for each sparse word wherein Is the sparse word OiThe length of the second-order neighborhood list, and then obtaining a similarity matrix according to the co-occurrence matrix CThe calculation formula is shown as formula (1);
the word embedding method based on the similarity of the word co-occurrence vectors is based on a self-encoder model, the self-encoder comprises an encoder and a decoder, the word embedding method based on the similarity of the word co-occurrence vectors adopts a two-layer fully-connected neural network as the encoder, the input of the encoder is a co-occurrence matrix C, the output of the encoder is a text representation matrix L, the decoder of the word embedding method based on the similarity of the word co-occurrence vectors needs to reconstruct the co-occurrence matrix C and a similarity matrix S, the word embedding method based on the similarity of the word co-occurrence vectors uses the other two-layer fully-connected neural network as the decoder, and the decoder needs to generate a reconstructed co-occurrence matrixAnd generating a reconstruction similarity matrix according to the formula (2)
The training process of the word embedding method based on word co-occurrence vector similarity is expressed as follows:
4. the method of claim 1, wherein the network spoofing detection comprises: in the step 3), the step (c),
fusing the social information representation and the text information representation learned in the steps 1) and 2), and generating a fusion vector in a splicing mode, wherein the dimensionality of the fusion vector is the maximum text length plus one (representing the user U)p) The first line of the fusion vector is represented by social information of a user who issues the short text, each subsequent line is represented by words in a sentence in a corresponding sequence, and if the length of the sentence is smaller than the maximum sentence length, a zero vector is spliced at the end of the fusion vector;
the fused vector is input into an octopus classifier to identify whether the short text is an octopus.
5. The method of claim 4, wherein the network spoofing detection comprises: the deception text classifier is based on a neural network structure, a bidirectional long-short term memory network is used as a detector, and the size of an Input Layer in the classifier is max l + 1; output Layer has k neurons representing k text classes, and the drop rates of Dropout Layer 1 and Dropout Layer 2 are set to 0.25 and 0.5, respectively, using the Softmax function as the activation function.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010083486.5A CN111274403B (en) | 2020-02-09 | 2020-02-09 | Network spoofing detection method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010083486.5A CN111274403B (en) | 2020-02-09 | 2020-02-09 | Network spoofing detection method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111274403A true CN111274403A (en) | 2020-06-12 |
CN111274403B CN111274403B (en) | 2023-04-25 |
Family
ID=71003558
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010083486.5A Active CN111274403B (en) | 2020-02-09 | 2020-02-09 | Network spoofing detection method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111274403B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111814454A (en) * | 2020-07-10 | 2020-10-23 | 重庆大学 | Multi-modal network spoofing detection model on social network |
CN114036366A (en) * | 2021-11-19 | 2022-02-11 | 四川大学 | Social network deception detection method based on text semantics and hierarchical structure |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040064304A1 (en) * | 2002-07-03 | 2004-04-01 | Word Data Corp | Text representation and method |
US20130138428A1 (en) * | 2010-01-07 | 2013-05-30 | The Trustees Of The Stevens Institute Of Technology | Systems and methods for automatically detecting deception in human communications expressed in digital form |
US20190014071A1 (en) * | 2016-10-13 | 2019-01-10 | Tencent Technology (Shenzhen) Company Limited | Network information identification method and apparatus |
GB2572320A (en) * | 2018-03-12 | 2019-10-02 | Factmata Ltd | Hate speech detection system for online media content |
CN110704715A (en) * | 2019-10-18 | 2020-01-17 | 南京航空航天大学 | Network overlord ice detection method and system |
-
2020
- 2020-02-09 CN CN202010083486.5A patent/CN111274403B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040064304A1 (en) * | 2002-07-03 | 2004-04-01 | Word Data Corp | Text representation and method |
US20130138428A1 (en) * | 2010-01-07 | 2013-05-30 | The Trustees Of The Stevens Institute Of Technology | Systems and methods for automatically detecting deception in human communications expressed in digital form |
US20190014071A1 (en) * | 2016-10-13 | 2019-01-10 | Tencent Technology (Shenzhen) Company Limited | Network information identification method and apparatus |
GB2572320A (en) * | 2018-03-12 | 2019-10-02 | Factmata Ltd | Hate speech detection system for online media content |
CN110704715A (en) * | 2019-10-18 | 2020-01-17 | 南京航空航天大学 | Network overlord ice detection method and system |
Non-Patent Citations (2)
Title |
---|
NEKTARIA POTHA等: "Cyberbullying Detection using Time Series Modeling", 《2014 IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOP》 * |
仲丽君: "社交网络不良言论用户识别研究", 《中国优秀硕士学位论文全文数据库(信息科技辑)》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111814454A (en) * | 2020-07-10 | 2020-10-23 | 重庆大学 | Multi-modal network spoofing detection model on social network |
CN111814454B (en) * | 2020-07-10 | 2023-08-11 | 重庆大学 | Multi-mode network spoofing detection model on social network |
CN114036366A (en) * | 2021-11-19 | 2022-02-11 | 四川大学 | Social network deception detection method based on text semantics and hierarchical structure |
Also Published As
Publication number | Publication date |
---|---|
CN111274403B (en) | 2023-04-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11669689B2 (en) | Natural language generation using pinned text and multiple discriminators | |
Ren et al. | A sentiment-aware deep learning approach for personality detection from text | |
US10437936B2 (en) | Generative text using a personality model | |
Agarwal et al. | Fake news detection: an ensemble learning approach | |
Shahi et al. | A hybrid feature extraction method for Nepali COVID-19-related tweets classification | |
Daumé III et al. | A large-scale exploration of effective global features for a joint entity detection and tracking model | |
Zheng et al. | The email author identification system based on Support Vector Machine (SVM) and Analytic Hierarchy Process (AHP) | |
CN112214614B (en) | Knowledge-graph-based risk propagation path mining method and system | |
CN112612871B (en) | Multi-event detection method based on sequence generation model | |
Brocardo et al. | Verifying online user identity using stylometric analysis for short messages | |
CN111274403B (en) | Network spoofing detection method | |
Islam et al. | Deep Learning for Multi-Labeled Cyberbully Detection: Enhancing Online Safety | |
CN113849597A (en) | Illegal advertising word detection method based on named entity recognition | |
CN116049419A (en) | Threat information extraction method and system integrating multiple models | |
Prachi et al. | Detection of Fake News Using Machine Learning and Natural Language Processing Algorithms [J] | |
Ullah et al. | A deep neural network-based approach for sentiment analysis of movie reviews | |
Hofmann et al. | A graph auto-encoder model of derivational morphology | |
Meng et al. | Predicting hate intensity of twitter conversation threads | |
Gôlo et al. | One-class learning for fake news detection through multimodal variational autoencoders | |
Mahajan et al. | EnsMulHateCyb: Multilingual hate speech and cyberbully detection in online social media | |
Zhao et al. | Topic identification of text‐based expert stock comments using multi‐level information fusion | |
Sindhuja et al. | Twitter Sentiment Analysis using Enhanced TF-DIF Naive Bayes Classifier Approach | |
Mathur et al. | Analysis of Tweets for Cyberbullying Detection | |
Nisha et al. | Detection and classification of cyberbullying in social media using text mining | |
Zhang et al. | Sentiment identification by incorporating syntax, semantics and context information |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |