CN112699217B

CN112699217B - Behavior abnormal user identification method based on user text data and communication data

Info

Publication number: CN112699217B
Application number: CN202011588924.XA
Authority: CN
Inventors: 程鹏飞; 敬好青; 何芳; 刘敏
Original assignee: Xi'an Jiusuo Data Technology Co ltd
Current assignee: Xi'an Jiusuo Data Technology Co ltd
Priority date: 2020-12-29
Filing date: 2020-12-29
Publication date: 2023-04-18
Anticipated expiration: 2040-12-29
Also published as: CN112699217A

Abstract

The invention discloses a behavior abnormity user identification method based on user text data and communication data, which comprises the steps of establishing user text data according to a mobile phone number of a user, and filtering the user text data by keywords and expansion words to obtain suspected abnormal text; when the number of the suspected abnormal texts is larger than zero, constructing a user abnormal behavior recognition model based on the user text content; when the number of the suspected abnormal texts is not larger than zero, constructing an abnormal behavior user identification model of social network analysis; judging the user behavior abnormity by adopting a user abnormal behavior identification model based on the user text content, and if so, listing the user behavior abnormity in a behavior abnormity personnel information base according to the owner information; if not, entering an abnormal behavior user identification model of social network analysis; and judging the user behavior abnormity by adopting an abnormal behavior user identification model analyzed by the social network, and if so, listing the abnormal behavior user identification model into a behavior abnormity personnel information base according to the owner information.

Description

Behavior abnormal user identification method based on user text data and communication data

Technical Field

The invention discloses application of computer technology in the field of public safety, and particularly relates to a method for identifying a user with abnormal behavior based on user text data and communication data.

Background

With the development of deep learning and computer vision, behavior recognition has made significant progress and is widely applied in the field of public security. At present, in the field of human behavior recognition, relevant features are mostly directly extracted from original video frames, and a deep learning network model is utilized for recognition. The method has a large amount of information redundancy, so that large noise is brought to the neural network model, the accuracy and the speed of behavior recognition are influenced, and a large amount of missed detection exists.

Social crime behaviors such as terrorism, explosion and virus are involved, which seriously endangers social security, destroys national economy, greatly hinders the development and progress of human society and brings great challenges to social harmony and stability. The behavior characteristics of the abnormal users are analyzed, the abnormal users are actively found and are subjected to key monitoring, social harm activities can be effectively prevented, and the method has great significance for maintaining social stability.

Disclosure of Invention

The invention aims to provide a method for identifying users with abnormal behaviors based on user text data and communication data, which comprehensively considers the text data and the communication data of users, constructs an abnormal user identification model in multiple aspects, actively discovers the users with abnormal behaviors, accurately attacks the criminal behaviors damaging the social stability and reduces the cost for maintaining the social stability on one hand, and eliminates the young ends harming the social behaviors from the source before activities occur by monitoring the users with the abnormal behaviors on the other hand.

In order to realize the task, the invention adopts the following technical solution:

a behavior abnormal user identification method based on user text data and communication data is characterized by comprising the following steps:

the first step is as follows: establishing user text data according to the mobile phone number of a user, and filtering the user text data through keywords and expansion words to obtain suspected abnormal text;

the second step: when the number of the suspected abnormal texts is larger than zero, constructing a user abnormal behavior recognition model based on the user text content;

when the number of the suspected abnormal texts is not larger than zero, constructing an abnormal behavior user identification model of social network analysis;

the third step: judging the user behavior abnormity by adopting a user abnormal behavior identification model based on the user text content, and if so, listing the user behavior abnormity in a behavior abnormity personnel information base according to the owner information; if not, entering an abnormal behavior user identification model of social network analysis;

the fourth step: and judging the user behavior abnormity by adopting an abnormal behavior user identification model analyzed by the social network, and if so, listing the abnormal behavior user identification model into an abnormal behavior personnel information base according to the owner information.

According to the invention, the construction steps of the user abnormal behavior recognition model based on the user text content are as follows:

step 1: establishing an abnormal behavior keyword word bank omega by combining various common abnormal behavior words ₁ Using omega ₁ Preliminarily screening suspected abnormal text omega ₁ ；

And 2, step: using abnormal behavior keywords as root words to expand words, establishing an expanded word bank omega containing keywords but having no abnormal behavior ₂ Using omega ₂ Further filter out omega ₁ In (1) contains omega ₂ To obtain omega ₂ ；

And step 3: constructing an abnormal text recognition model based on a Bert pre-training text classification method, wherein the model structure consists of five layers, namely a text input layer, a word embedding layer, a multi-level Transformer layer, a full connection layer and a Softmax layer; wherein, each input unit content E of the text input layer _i Corresponds to omega ₂ The ith token of each sentence sequence, the first mark of each sequence is special classification embedding, and the word embedding layer embeds the words E _i Translating the corresponding word vector V _i ；

The word vector V _i The composition of (A) is as follows:

V _i ＝tokenembeddings(E _i )+segmentation embeddings(E _i )+position embeddings(E _i )；

wherein, tokeneaddressing (E) _i ) Word vector conversion using WordPiece embedding and 30,000 vocabularies, positional embeddings (E) _i ) According to each E _i At the position where it is transformed, supporting the length of the sequence at the most512 for limiting the length of the sentence, segmentation templates (E) _i ) The sentence pairs are packed into a sequence, a plurality of sentences can be input simultaneously, and are separated by special marks; secondly, adding a spare sensor A to be embedded into each token of a first sentence, and embedding a sensor B into each token of a second sentence; the multistage Transformer performs feature extraction on the input word vectors and outputs context features T of each character _i ，C _i Representing a hidden state of the classification category of the input sentence, C _i And a full connection layer is connected in the rear for extracting characteristic category characteristics, and the probability of each category is calculated through a Softmax layer, namely: p (C) _i )＝Softmax(C _i W); w is the weight of the full connection layer, the probability values of all categories are compared, and the text is classified into the category with high probability;

the model training parameters include: learning rate l ₁ Maximum length of text max _ length, training times epochs, amount of training data batch _ size each time, and extracting part ω ₂ Data omega ₃ As training data, class labeling is performed, model parameters are adjusted, a model is trained by using a training set, and a test set omega is used ₄ I.e. omega ₃ Testing the stability and accuracy of the model on the subset to determine the optimal parameters of the model;

and 4, step 4: will omega ₂ Inputting the marked omega into the trained model ₂ Each text type is marked as an abnormal user number by combining the number of the user aiming at the abnormal text;

and 5: and associating the user owner information table to obtain detailed information such as certificate numbers, addresses and the like of abnormal users.

Further, the construction steps of the abnormal behavior user identification model of the social network analysis are as follows:

step 1: according to the short message data and the communication data of the abnormal user, calculating the first-degree relation user communication service number and the second-degree relation service number of the abnormal user, counting the communication times between the first-degree relation user communication service number and the second-degree relation service number, and establishing an input node relation set S1 (P) _I ->P _O ->num)，P _I Is an input node, P _O Is an opposite-end output node, num is a two-node userThe first degree relation user refers to a user who has direct communication with an abnormal user, the second degree relation user refers to a user who does not have direct communication with the abnormal user, but has communication with the abnormal user;

step 2: computing a set of input node relationships S1 (P) _I ->P _O ->num), including owner name, social attribute;

and step 3: with input node relation set S1 (P) _I ->P _O ->num) as a basis, establishing an input node relation and node label table S2, which comprises an input node service number, an input node owner name, an input node social attribute, an output node service number, an output node owner name, an output node social attribute and the number of communication;

and 4, step 4: according to the node label table S2, a force guide graph layout is adopted to draw a relation network of input nodes, the number of times of communication between users is used as the connection line weight between the nodes, when the attractive force between the nodes is different based on the weight of the edges, the larger the weight of the edges is, the larger the attractive force between the nodes is, the more the nodes are gathered, and all the input nodes are traversed to establish the whole abnormal user social network;

and 5: and constructing an output index to control the display content of the social network.

Specifically, the specific steps of constructing the output index to control the display content of the social network are as follows:

step 1: summarizing the opposite ends of the S1, removing input nodes in the opposite ends to form an opposite end node set S3;

step 2: calculating the number of friends of each input node and the number of friends of an opposite end node, wherein the number of friends of a user is defined as the number of users communicating with the user;

and step 3: removing nodes with the input node friend number of 1 in the opposite-end node set S3 to form a candidate node set S4;

and 4, step 4: counting the input node friend numbers of all candidate nodes to form a candidate node counting table S5, wherein the candidate node counting table comprises the input node friend numbers of the candidate nodes, the candidate node numbers and the accumulated ratio, and the input node friend numbers are sorted from large to small;

and 5: and using the accumulated ratio as a specific output index of the display of the social network, when the output index is 0, only the input nodes are displayed, and when the output index is 1, the input nodes and all the candidate nodes are displayed.

Preferably, the step of identifying an abnormal user flow comprises:

step 1: inputting the mobile phone number of the user, extracting and analyzing all text data of the user in a time range, and utilizing the omega of the behavior abnormal word bank ₁ And extended thesaurus omega ₂ Screening out suspected abnormal text omega ₂ ；

And 2, step: judging suspected abnormal text omega ₂ Quantity, selecting different recognition models; if ω is ₂ And if the quantity is larger than zero, selecting an abnormal user identification model based on the text content for identification, otherwise, selecting an abnormal user identification method based on the social network analysis, and judging whether the user has abnormal behaviors.

The method for screening out key people of other users with abnormal behaviors from the network by adopting a network analysis algorithm based on the social network of the users with abnormal behaviors comprises the following steps:

step 1: k-shell value calculation

Calculating K-shell values for networks formed by all the nodes, counting the K-shell values of all the nodes, forming an all-node statistical table, wherein the all-node statistical table comprises the K-shell values, the node number and the accumulative ratio, the accumulative ratio is used as a specific display index, when the value is 0, no node is displayed, and when the core index is 1, all the nodes are displayed;

the K-shell value calculation method comprises the following steps: firstly, finding out nodes with all degrees of 1, deleting the nodes, then continuously searching the nodes with the degrees of 1 in the rest nodes, deleting the nodes until no node with the degrees of 1 exists in the network, assigning the K-shell value to be 1 for all the nodes with the degrees of 1 deleted before, namely KS =1, and then searching the nodes with the degrees of other values by the same method;

step 2: median centrality calculation

1) Selecting a source node s to perform breadth-first search and calculating sigma _st ，σ _st Representing the number of s → t shortest paths;

2) Each node is reserved as a set of other node predecessors in the traversal process, and the predecessor node is defined as P _s (v) { u, V }. Epsilon.E, d (s, V) = d (s, u) + d (u, V) }, if the shortest path from s → V contains a node (u, V), then u belongs to this set of s → V predecessor nodes, where V is all nodes in the network, E is the set of two nodes of an edge, and d (s, V) represents the minimum number of nodes passed by the node s to V;

3) Calculating the median central degree delta _s (v)＝∑w:v∈P _s (w)σ _sv σ _sw (1+δ _s (w)), v is the s to w precursor node; the betweenness centrality reflects the information transmission capability of the node, and the higher the betweenness centrality of the node is, the more important the position of the node in the network is;

and step 3: and screening users with abnormal behaviors, adjusting the K-shell value, the betweenness central value, the output index and the node degree, screening the social network, and when the user node appears in the rest networks, indicating that the user is the user with abnormal behaviors.

The method for identifying the user with the abnormal behavior based on the user text data and the communication data adopts a network analysis algorithm to mine key people of other users with the abnormal behavior from a network; and associating the user owner information table to obtain the detailed information of the user with the abnormal behavior. Text data and communication data of users are comprehensively considered, an abnormal user identification model is constructed in multiple aspects, users with abnormal behaviors are actively discovered, on one hand, criminal behaviors damaging social stability are accurately struck, the cost for maintaining social stability is reduced, on the other hand, the young people damaging social behaviors are eliminated from the source before activities occur by monitoring users with abnormal behaviors.

Drawings

FIG. 1 is a flow chart of a method for identifying a user with abnormal behavior based on user text data and communication data according to the present invention;

FIG. 2 is a flow chart of building a user abnormal behavior recognition model based on user text content;

FIG. 3 is a diagram of an abnormal text recognition model structure;

FIG. 4 is a schematic diagram of an anomalous user social network.

The present invention will be described in further detail with reference to the following drawings and examples.

Detailed Description

Referring to fig. 1, the embodiment provides a method for identifying a user with abnormal behavior based on user text data and communication data, which specifically includes the following steps:

the first step is as follows: establishing user text data according to the mobile phone number of the user, and filtering the user text data by keywords and expansion words to obtain suspected abnormal text;

the second step is that: when the number of the suspected abnormal texts is larger than zero, constructing a user abnormal behavior recognition model based on the user text content;

when the number of the suspected abnormal texts is not larger than zero, constructing an abnormal behavior user identification model for social network analysis;

1. The user abnormal behavior recognition model based on the user text content is constructed and shown in the figure 2, and the method comprises the following steps:

Step 2: using abnormal behavior keywords as root word expansion, establishing an expanded word bank omega containing keywords but having no abnormal behavior ₂ Using omega ₂ Further filter out omega ₁ In which omega is contained ₂ To obtain omega ₂ ；

And 3, step 3: method for constructing abnormal text recognition model based on Bert pre-training text classification method(structural diagram see fig. 3), consisting of five layers, i.e., a text input layer, a word embedding layer, a multi-level Transformer layer, a full-link layer, and a Softmax layer; wherein, each input unit content E of the text input layer _i Corresponding to omega ₂ The ith token of each sentence sequence, the first mark of each sequence is special classification embedding, and the word embedding layer embeds the words E _i Translating the corresponding word vector V _i (ii) a The word vector V _i The composition of (A) is as follows:

wherein, tokeneaddressing (E) _i ) Word vector transformations, positional embeddings (E), using WordPiece embedding and 30,000 vocabularies _i ) According to each E _i The position where the sentence is located is converted, the supported sequence length is at most 512, the length of the sentence is limited, and the segmentation elements (E) _i ) The sentence pairs are packed into a sequence, a plurality of sentences can be input simultaneously, and are separated by special marks; secondly, adding a spare sensor A to be embedded into each token of the first sentence, and embedding a sensor B into each token of the second sentence; the multistage Transformer performs feature extraction on the input word vectors and outputs context features T of each character _i ，C _i Representing a hidden state of the classification category of the input sentence, C _i And then connecting a full connection layer to extract the characteristic class characteristics, and calculating the probability of each class through a Softmax layer, namely: p (C) _i )＝Softmax(C _i W); w is the weight of the full connection layer, the probability values of all categories are compared, and the text is classified into the category with high probability;

the model training parameters include: learning rate l ₁ Maximum length of text max _ length, number of training epochs, amount of training data per time batch _ size, and extraction of portion ω ₂ Data omega ₃ As training data, class labeling is performed, model parameters are adjusted, a model is trained by using a training set, and a test set omega is used ₄ (ω ₃ Subset of) to test the stability and accuracy of the model to determine the model's optimal parametersAnd (4) counting.

And 4, step 4: will omega ₂ Inputting the marked omega into the trained model ₂ And each text type is marked as an abnormal user number by combining the number of the user aiming at the abnormal text.

2. Constructing an abnormal behavior user identification model of social network analysis, comprising the following steps:

step 1: according to the short message data and the communication data of the abnormal user, calculating the first-degree relation user communication service number and the second-degree relation service number of the abnormal user, counting the communication times between the first-degree relation user communication service number and the second-degree relation service number, and establishing an input node relation set S1 (P) _I ->P _O ->num)，P _I Is an input node, P _O The number num is the number of times of communication between two node users, a first-degree relation user refers to a user who has direct communication with an abnormal user, a second-degree relation user refers to a user who has no direct communication with the abnormal user but has communication with the abnormal user;

step 2: calculating data labels of all users in the S1, including owner names, social attributes and the like;

and step 3: based on the S1, establishing an input node relation and node label table S2, which comprises an input node service number, an input node owner name, an input node social attribute, an output node service number, an output node owner name, an output node social attribute and the number of communication;

and 4, step 4: according to the S2, a force guide graph layout is adopted to draw a relation network of input nodes, the number of times of communication between users is used as the connection line weight between the nodes, when the attraction between the nodes is different based on the weight of the edges, the larger the weight of the edges is, the larger the attraction between the nodes is, the more the nodes are gathered, and all the input nodes are traversed to establish the whole abnormal user social network, as shown in FIG. 4.

And 5: constructing an output index to control the display content of the social network, wherein the specific steps comprise;

step 5.1: summarizing the opposite ends of the S1, removing input nodes therein, and forming an opposite end node set S3;

step 5.2: calculating the number of friends of each input node and the number of friends of an opposite end node, wherein the number of friends of a user is defined as the number of users communicating with the user;

step 5.3: removing nodes with the input node friend number of 1 in the opposite-end node set S3 to form a candidate node set S4;

step 5.4: counting the input node friend numbers of all candidate nodes to form a candidate node counting table S5, wherein the candidate node counting table comprises the input node friend numbers of the candidate nodes, the candidate node numbers and the accumulated ratio, and the input node friend numbers are sorted from large to small;

step 5.5: and using the accumulated ratio as a specific output index of the display of the social network, when the output index is 0, only the input nodes are displayed, and when the output index is 1, the input nodes and all the candidate nodes are displayed.

3. Based on the social network of the abnormal users, a key person analysis method is established, and other abnormal users are screened out from the network, and the method is characterized by comprising the following steps:

step 1: k-shell value calculation

Calculating K-shell values for networks formed by all the nodes, counting the K-shell values of all the nodes, forming an all-node statistical table, wherein the all-node statistical table comprises the K-shell values, the node number and the accumulative ratio, the accumulative ratio is used as a specific display index, when the value is 0, no node is displayed, and when the core index is 1, all the nodes are displayed. The K-shell value calculation method comprises the following steps: firstly, finding out all nodes with the degree of 1, deleting the nodes, then continuously searching the nodes with the degree of 1 in the rest nodes, and deleting the nodes. Until no node with the degree of 1 exists in the network, all the nodes with the degree of 1 deleted before are assigned with the value of 1, namely KS =1, and then nodes with the degrees of other values are searched in the same way.

Step 2: median centrality calculation

Step 2.1, selecting a source node s to perform breadth-first search, and calculating sigma _st ，σ _st Representing the number of s → t shortest paths;

step 2.2. In the traversal procedure, keep everyEach node is taken as a precursor set of other nodes, and the precursor node is defined as P _s (v) { u ∈ V: { u, V } ∈ E, d (s, V) = d (s, u) + d (u, V) }, if the shortest path from s → V contains a node (u, V), then u belongs to the s → V predecessor node set, where V is all nodes of the network, E is two node sets of an edge, and d (s, V) represents the minimum number of nodes passed by the node s to V;

step 2.3. Calculating the betweenness centrality delta _s (v)＝∑w:v∈P _s (w)σ _sv σ _sw (1+δ _s (w)), v is the s to w precursor node. The betweenness centrality reflects the information transmission capability of the node, and the higher the betweenness centrality of the node is, the more important the node is at the position in the network.

4. The step of identifying the abnormal user flow comprises the following steps:

Step 2: judging suspected abnormal text omega ₂ Quantity, different recognition models are selected. If ω is ₂ And if the quantity is larger than zero, selecting an abnormal user identification model based on the text content for identification, otherwise, selecting an abnormal user identification method based on the social network analysis, and judging whether the user has abnormal behaviors.

Claims

1. A behavior abnormal user identification method based on user text data and communication data is characterized by comprising the following steps:

the user text content-based user abnormal behavior recognition model construction method comprises the following steps:

Step 2: using abnormal behavior keywords as root words to expand words, establishing an expanded word bank omega containing keywords but having no abnormal behavior ₂ By using omega ₂ Further filter out omega ₁ In (1) contains omega ₂ To obtain omega ₂ ；

The word vector V _i The composition of (A) is as follows:

V _i ＝tokenembeddings(E _i )+segmentation embeddings(E _i )+position

embeddings(E _i )；

wherein, tokeneaddressing (E) _i ) Word vector transformations, positional embeddings (E), using WordPiece embedding and 30,000 vocabularies _i ) According to each E _i The position where the sentence is located is converted, the supported sequence length is at most 512, the length of the sentence is limited, and the segmentation elements (E) _i ) Packing the sentence pairs into a sequence, inputting a plurality of sentences simultaneously, and separating the sentences by using special marks; secondly, adding a spare sensor A to be embedded into each token of the first sentence, and embedding a sensor B into each token of the second sentence; input of multistage Transformer pairsExtracting the character of the word vector and outputting the context character T of each character _i ，C _i Representing a hidden state of the classification category of the input sentence, C _i And then connecting a full connection layer to extract the characteristic class characteristics, and calculating the probability of each class through a Softmax layer, namely: p (Ci) = Softmax (CiW); w is the weight of the full connection layer, the probability values of all categories are compared, and the text is classified into the category with high probability;

wherein: the learned presence A refers to one sentence A in the learned sentences;

sensor B refers to a sentence B that has not been learned;

the model training parameters include: learning rate l1, maximum length of text max _ length, training times epochs, amount of training data batch _ size each time, and extraction of part ω ₂ Data omega ₃ As training data, class labeling is performed, model parameters are adjusted, a model is trained by using a training set, and a test set omega is used ₄ I.e. omega ₃ Testing the stability and accuracy of the model on the subset to determine the optimal parameters of the model;

and 5: associating a user owner information table to obtain a certificate number and an address of an abnormal user;

the method for constructing the abnormal behavior user identification model of the social network analysis comprises the following steps:

step 1: according to the short message data and the communication data of the abnormal user, calculating the first-degree relation user communication service number and the second-degree relation service number of the abnormal user, counting the communication times between the first-degree relation user communication service number and the second-degree relation service number, and establishing an input node relation set S1 (P) _I ->P _O ->num)，P _I Is an input node, P _O For the output node of the opposite terminal, num is the number of communication times between two node users, a first-degree relation user refers to a user who has direct communication with an abnormal user, and a second-degree relationThe user refers to a user with whom the abnormal user does not directly communicate with but has a communication with the abnormal user at a first degree;

and 3, step 3: with input node relation set S1 (P) _I ->P _O ->num) as a basis, establishing an input node relation and node label table S2, which comprises an input node service number, an input node owner name, an input node social attribute, an output node service number, an output node owner name, an output node social attribute and the number of communication;

and 4, step 4: according to the node label table S2, a force guide graph layout is adopted to draw a relation network of input nodes, the number of times of communication between users is used as the weight of connection lines between the nodes, when the attraction between the nodes is different based on the weight of the edges, the larger the weight of the edges is, the larger the attraction between the nodes is, the more the nodes are gathered, and all the input nodes are traversed to establish the whole abnormal user social network;

and 5: constructing an output index to control the display content of the social network;

2. The method of claim 1, wherein the step of constructing the output metrics to control the content of the social network comprises:

and 5: and using the accumulated proportion as a specific output index of the display of the social network, when the output index is 0, only displaying the input nodes, and when the output index is 1, displaying the input nodes and all candidate nodes.

3. The method of claim 1, wherein the first step and the second step specifically comprise the steps of:

Step 2: judging suspected abnormal text omega ₂ Quantity, selecting different recognition models; if ω is ₂ And if the quantity is larger than zero, selecting an abnormal user identification model based on the text content for identification, otherwise, selecting an abnormal user identification method based on the social network analysis, and judging whether the user has abnormal behaviors.

4. The method of claim 1, wherein the fourth step specifically comprises the steps of:

step 1: k-shell value calculation

the K-shell value calculation method comprises the following steps: firstly, finding out nodes with all degrees of 1, deleting the nodes, then continuously searching the nodes with the degrees of 1 in the rest nodes, deleting the nodes until no node with the degrees of 1 exists in the network, assigning the value of K-shell to be 1 for the nodes with all degrees of 1 deleted before, namely KS =1, and then searching the nodes with the degrees of other values by the same method;

step 2: mesomeric centrality calculation

2) Each node is reserved as a set of other node predecessors in the traversal process, and the predecessor node is defined as P _s (v) { u ∈ V: { u, V } ∈ E, d (s, V) = d (s, u) + d (u, V) }, if the shortest path from s → V contains a node (u, V), then u belongs to the s → V predecessor node set, where V is all nodes of the network, E is two node sets of an edge, and d (s, V) represents the minimum number of nodes passed by the node s to V;

d (s, u) represents the distance from s to u, d (u, v) represents the distance from u to v;

sigma denotes the sum, P _s (w) a set of predecessor nodes, σ, representing w in the shortest path s to w _sv Number of shortest paths, σ, representing s to v _sw Number of shortest paths, δ, representing s to w _s (v) Represents the ending centrality of w;