CN117312551A

CN117312551A - Social text classification method, system, computer equipment and storage medium

Info

Publication number: CN117312551A
Application number: CN202310930896.2A
Authority: CN
Inventors: 彭闯; 孙奕; 赵建强; 陈诚; 陈思萌; 潘国基
Original assignee: Xiamen Meiya Pico Information Co Ltd
Current assignee: Xiamen Meiya Pico Information Co Ltd
Priority date: 2023-07-27
Filing date: 2023-07-27
Publication date: 2023-12-29

Abstract

The application provides a social text classification method based on a graph rolling network, which comprises the following steps: acquiring social text data, wherein the social text data comprises a user and text content of the user; calculating the text content of each user, and applying a trained BERT-attribute model to obtain a user text vector; constructing a user association graph by taking the user text vector as a node and taking the number of text contents transmitted among users as edges; carrying out graph convolution operation based on the user association graph to obtain an association text vector; based on the associated text vector and the user text vector, a classification tag of the text content of the user is obtained. The scheme of the invention takes BERT and graph convolution neural network as technical basis, focuses on chat text of the people, performs information mining and modeling from the whole social content of the people, quantifies association among the people, and improves accuracy of social text classification.

Description

Social text classification method, system, computer equipment and storage medium

Technical Field

The application belongs to the technical field of natural language processing, and particularly relates to a social text classification method, a social text classification system, a social text classification computer device and a social text storage medium based on a graph rolling network.

Background

The user tag classification refers to a process of tagging a user according to various aspects such as a behavior pattern, speech content, information data, and the like of the user. In the prior art, the classification method of the user tag can be divided into three kinds, one is based on user habit analysis of a recommendation algorithm, such as recommendation algorithms of matrix decomposition, a factorizer, a deep collaborative neural network (Deep Cooperative Neural Network, deep CoNN) and the like, and people's consumption habits and people's personage of browsing habits can be built in the fields of electronic commerce, short videos and the like. The second is a character representation analysis based on keyword extraction, such as LDA topic word extraction model, tf-idf keyword extraction algorithm, and BI-lstm-term deep learning word extraction model. Third, user portrait construction based on text classification model, such as labeling chat text of user with text by TextCnn, textRnn, transformer model, classifying characters according to label result. However, in practical application, the above scheme still has the problems of inaccurate classification and the like.

Disclosure of Invention

In view of the above problems, a first aspect of the present invention provides a social text classification method based on a graph rolling network, including the steps of: acquiring social text data, wherein the social text data comprises a user and text content of the user; calculating the text content of each user to obtain a user text vector; constructing a user association graph by taking the user text vector as a node and taking the number of text contents transmitted among users as edges; carrying out graph convolution operation based on the user association graph to obtain an association text vector; based on the associated text vector, a classification tag of the text content of the user is obtained.

Preferably, a trained BERT-intent model is applied to obtain the user text vector.

Preferably, the input of the BERT-intent model is a sentence set X, and the construction of the sentence set X includes the steps of: performing primary classification on text content of a user; a certain number of sentences are extracted from each category to form a sentence set X, wherein the number of sentences extracted from each category is proportional to the ratio of the number of sentences in the category to the number of all sentences in the text content.

Preferably, the BERT-intent model calculates the text content of the input as sentence vectors, and applies a self-attention mechanism to weight sum the sentence vectors to obtain the user text vectors.

Preferably, the edges of the user association graph are built only between users who have sent text content whose number is greater than a threshold.

Preferably, the classification labels of the text content of the user are obtained based on the associated text vector and the user text vector.

Preferably, the calculation of the classification label of the text content of the user specifically comprises the steps of: splicing the associated text vector and the user text vector to obtain a spliced vector; and classifying the spliced vectors to obtain classification labels of the text contents of the users.

The second aspect of the present invention proposes a social text classification system based on a graph rolling network, comprising:

the data crawling module is configured to acquire social text data, wherein the social text data comprises a user and text content of the user;

the text content classification module is configured to calculate the text content of each user to obtain a user text vector;

the user association diagram construction module is configured to construct a user association diagram by taking the text vectors of the users as nodes and the quantity of text contents sent among the users as edges;

the picture convolution module is configured to perform picture convolution operation based on the user association graph to obtain an association text vector;

and the user portrait module is configured to obtain a classification label of the text content of the user based on the associated text vector.

A third aspect of the invention proposes a computer device comprising a memory storing a computer program and a processor implementing the method according to any of the first aspects when the processor executes the computer program.

A fourth aspect of the invention proposes a computer-readable storage medium storing a computer program which, when executed by a processor, implements a method according to any one of the first aspects.

The scheme of the invention is based on BERT and graph convolution neural network technology, fully excavates the information of social language of social characters in the network, and carries out label division on the characters. Compared with the existing user portrait label division technology aiming at the text classification model, the method is not only focused on chat texts of the people, but also used for carrying out information mining and modeling from the overall social content of the people, quantifying the association among the people, adding chat content information of the associated users connected with the users in the modeling process, and quantifying the chat content information together with text content information of the users, so that a person portrait method for judging the categories is obtained, and accuracy of social text classification is improved.

Drawings

The accompanying drawings assist in a further understanding of the present application. The elements of the drawings are not necessarily to scale relative to each other. For convenience of description, only parts related to the related invention are shown in the drawings.

FIG. 1 is a step diagram of a social text classification method based on graph convolutional network in an embodiment of the invention;

FIG. 2 is a diagram of a social text classification technique framework based on graph convolutional networks in accordance with an embodiment of the invention;

FIG. 3 is a diagram of a social text classification model framework based on a graph convolutional network in accordance with an embodiment of the present invention;

FIG. 4 is a diagram of a social text classification system framework based on a graph rolling network in accordance with another embodiment of the present invention.

Detailed Description

The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention.

BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained neural network model developed by Google. It employs a transform architecture and uses a bi-directional encoder to generate a context-dependent word vector representation. BERT is capable of achieving the most advanced performance levels in multiple natural language processing tasks, including question-answering, emotion analysis, named entity recognition, etc. The training process of the BERT model can be summarized in two phases: pretraining and fine tuning. During the pre-training phase, BERT learns generic language representations using massive amounts of unlabeled text data, which can be applied to various natural language processing tasks. In the fine tuning stage, BERT is fine tuned from domain-specific marker data by supervised learning to improve its performance.

The main technology of the invention is that firstly, a base class model classifier is trained by using a BERT model, category judgment and sentence vector conversion are carried out on chat texts, secondly, chat texts of the chat texts and contacts are mined and analyzed by using self-attribute and a graph-roll neural network (GCN) by taking a chat person as a unit, and finally, category labels are marked on characters by the result of mining and analysis.

Fig. 1 is a step diagram of a social text classification method based on a graph convolutional network in an embodiment, and fig. 2 is a frame diagram of a social text classification technology based on the graph convolutional network in the embodiment. The method specifically comprises the following steps:

s1, acquiring social text data, wherein the social text data comprise users and text contents of the users.

Typically, data may be obtained from social software such as microblogs, QQ, etc. by a web crawler or the like. Specifically, in this embodiment, group chat data is acquired by a web crawler method. Dividing the crawled data according to chat content and chat persons to obtain a person set P= { P ₁ ,p ₂ ,...p _m Sum text set s= { S } ₁ ,s ₂ ,...s _n Where m represents the number of chat characters and n represents the content of the chat text. The results of the data after washing in a standardized format are shown in table 1.

Table 1 group chat data format examples

Generally, because the chat content contains a part of disturbing vocabulary of network language, data cleaning pretreatment is needed to be performed on the chat text first. The method comprises the following steps of: firstly, filtering the interference texts such as mobile phone numbers, identification card numbers, chat prompt sentences, web page links and the like in the texts by using regular expressions; 2. filtering the data to remove the emoticons and stop words by using the collected emoticons, dictionary and stop word dictionary; 3. and utilizing the collected special dictionary in the field of social texts as a user dictionary, and utilizing a jieba word segmentation tool to carry out word segmentation processing on chat contents.

S2, calculating the text content of each user to obtain a user text vector.

And extracting part of chat content from the preprocessed data set to form a data set, and carrying out data marking on the extracted data set. Dividing the marked data set into a training set, a testing set and a verification set, and training the emotion classification model. The invention uses the BERT model to train the basic classification model, and trains a text classification model by using the marked data set.

In this embodiment, the user text vector is synthesized based on the BERT-intent model. The user text vector is a text vector quantized based on all chat contents of a chat user, and the user text vector synthesis steps are as follows:

first, the user's chat content is text-classified using the BERT model that has been trained.

Secondly, setting the extraction quantity m of sentences, and extracting m pieces of data from the chat text according to category proportions to form a sentence set X= { X _i ,x ₂ ,...x _m }。

Thirdly, carrying out the text-scoring operation on the input sentences by using a trained BERT classifier to obtain the text-scoring, and carrying out self-scoring weighted summation on the summarized sentence vectors to obtain the personal text vector v. The specific formula is as follows:

where f represents the generation function of the BERT model,representing a matrix of concatenated sentence vectors, W _t ，b _t For the purpose of calculating weight matrix and paranoid items in attention, v is personal comprehensive text vector (personal comprehensive vector) is a comprehensive vector representation of chat text for a user's target.

S3, constructing a user association diagram by taking the user text vector as a node and taking the number of text contents transmitted among users as edges.

Social text is a text form of communication and interaction between people, and the influence of one person in the social text is not only from the person, but also from the nearby contact. The graph convolutional neural network (GCN) is a deep learning model for capturing information between nodes in the form of a graph, and in this embodiment, chat content is taken as a connection edge, and a user text vector of a chat user is taken as a node vector to establish the graph convolutional neural network.

The specific steps for constructing the user association graph comprise: the invention sets a threshold number m for reducing the construction of invalid edges by taking the user text vector of a chat user as a node and constructing the connected edges by the number of texts sent by the chat user and other chat users, and if the number of texts sent by the users is larger than m, the construction of the edges is carried out. Constructed adjacency matrix a _ij The formula is as follows:

wherein n is _ij Representing the number of sentences a sent between user i and user j, a _j The value representing the adjacent node j using node i is normalized.

And S4, carrying out graph convolution operation based on the user association graph to obtain an association text vector.

According to the graph convolution and the user comprehensive text vector corresponding to the user, global graph convolution iteration is carried out, and the output value of the node in the graph convolution can be used as an input value for carrying out iteration for a plurality of times. Let the dependency graph g= (v, epsilon), v and epsilon represent the node set and edge set, then the thVector output of k-layer node iThe calculation formula of (2) is as follows:

A _ij representing the values of the adjacency matrix,representing the output of node j at layer k-1, w ^(k) Representing a weight matrix, b ^(k) Representing the deviation vector, 1/d _i Representing the normalization operation on node i, ρ () represents the relu activation function. The k-layer node can not only accept the output of the k-1 layer of the node, but also the output of all the previous layers, and in order to sum the output vector of each layer of node in the receptive field, the self-attention operation is performed on the output of the last layer. The main formula is:

h _i a user association vector derived on behalf of the ith node,output information representing other layers carried by node i in the convolution of a k-th layer graph, v _i The initial vector representing the ith node, which is also the personal comprehensive class vector v, g (), obtained in the previous step, represents the self-attention mechanism function, W _i ，B _i A weight matrix and bias terms that are self-attention mechanism functions at node i.

S5, based on the associated text vector, obtaining the classification label of the text content of the user. And selecting a proper classification function to classify the associated text vector, so that a classification label can be obtained.

In a preferred embodiment, the labels of the persons may be derived from the personal text vector and the associated text vector together. Fig. 3 is a frame diagram of a model in this embodiment, the obtained associated text vector and the personal text vector are spliced, and the probability category is determined by using the full-connection and softmax functions. The specific function formula is as follows:

x＝concat(h _i ，v _i )

z＝fc(dropout(x))

pred＝soft max(z)

and finally, carrying out argmax judgment according to the probability result of pred, and taking the category corresponding to the value with the maximum probability as the category label of the user.

The loss function still adopts a cross entropy loss function, a dropout function and an L1 regular term coefficient are added to the full-connection layer to prevent the model from being over-fitted, and the loss function has the following formula:

L＝-∑y _i log(p _i )+λ||Θ||

y _i representing the authentic label p _i Representing the predicted probability value, Θ represents the other parameters of the model.

FIG. 4 is a diagram of another embodiment of a social text classification system 400 based on a graph convolutional network, comprising:

a data crawling module 401 configured to obtain social text data, the social text data including a user and text content of the user;

a text content classification module 402, configured to calculate text content of each user, and obtain a text vector of the user;

the user association diagram construction module 403 is configured to construct a user association diagram by using the text vector of the user as a node and the number of text contents sent between the users as edges;

a graph convolution module 404 configured to perform a graph convolution operation based on the user-associated graph to obtain an associated text vector;

a user portrayal module 405 configured to obtain a category label for text content of a user based on the associated text vector.

The invention provides a character portrait analysis method based on Bert and GCN aiming at social text classification, which takes the text of a chat person as main analysis content, and obtains the character portrait by quantifying two aspects of the social text content of the chat person and related personnel through sentence vector conversion, self-attention mechanism and other deep learning methods. The method can make up for the defects that the classification model in the prior art is insufficient in mining the chat content of the person, and the relevance of the chat person is not considered. The invention has wide application prospect in the functions of figures of various terminals APP and the like.

While the present application has been particularly shown and described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the application as defined by the appended claims.

Claims

1. A social text classification method based on a graph rolling network is characterized by comprising the following steps:

acquiring social text data, wherein the social text data comprises a user and text content of the user;

calculating the text content of each user to obtain a user text vector;

constructing a user association graph by taking the user text vector as a node and taking the number of text contents transmitted among users as edges;

performing graph convolution operation based on the user association graph to obtain an association text vector;

and obtaining the classification labels of the text contents of the user based on the associated text vectors.

2. The social text classification method based on graph rolling network as claimed in claim 1, wherein the user text vector is obtained by applying a trained BERT-intent model.

3. The social text classification method based on graph rolling network as claimed in claim 2, wherein the input of the BERT-intent model is a sentence set X, and the construction of the sentence set X includes the steps of:

performing primary classification on the text content of the user;

m sentences are extracted from each category to form the sentence set X, wherein m is proportional to the ratio of the number of sentences in the category to the number of all sentences of the text content.

4. The social text classification method based on graph rolling network as claimed in claim 2, wherein the BERT-intent model calculates the inputted text content as sentence vectors, and applies self-attention mechanism to weight-sum the sentence vectors to obtain user text vectors.

5. The graph-rolling network-based social text classification method of claim 1 wherein edges of the user-associated graph are built only between users who have a number of text content sent greater than a threshold.

6. The graph-rolling network-based social text classification method of claim 1, wherein the classification labels of the user's text content are obtained based on the associated text vector and the user text vector.

7. The social text classification method based on graph rolling network as claimed in claim 6, wherein the calculation of classification labels of text contents of the user specifically comprises the steps of:

splicing the associated text vector and the user text vector to obtain a spliced vector;

and classifying the spliced vectors to obtain classification labels of text contents of users.

8. A social text classification system based on a graph rolling network, comprising:

the user association diagram construction module is configured to construct a user association diagram by taking the user text vector as a node and taking the number of text contents sent among users as edges;

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the method of any one of claims 1 to 7 when executing the computer program.

10. A computer readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the method of any one of claims 1 to 7.