CN115329085A

CN115329085A - Social robot classification method and system

Info

Publication number: CN115329085A
Application number: CN202211039150.4A
Authority: CN
Inventors: 徐雅斌; 毛文清
Original assignee: Beijing Information Science and Technology University
Current assignee: Beijing Information Science and Technology University
Priority date: 2022-08-29
Filing date: 2022-08-29
Publication date: 2022-11-11

Abstract

The invention discloses a social robot classification method and a system, which relate to the technical field of social robot detection, and the method comprises the following steps: acquiring the content of the blog about the target topic of the target social robot; inputting the content of the blog into a social robot classification model to obtain the category of the content of the blog; the social robot classification model comprises a topic relevance target model and a viewpoint sentence identification target model; the method for determining the social robot classification model comprises the following steps: constructing a source domain data set based on transfer learning; determining a target domain dataset based on a social robot recognition model; expanding and compressing the set topics in the source domain data set; determining a topic correlation target model according to the source domain data set, the target domain data set, the compressed topic expansion sequence and the twin network; and determining a viewpoint sentence recognition target model according to the source domain data set, the target domain data set, the rule-based viewpoint sentence recognition method and the text classification model. The invention can improve the universality and interpretability of the classification method.

Description

Social robot classification method and system

Technical Field

The invention relates to the technical field of social robot detection, in particular to a social robot classification method and system.

Background

With the rise of social networks such as twitter, microblog, wechat and live broadcast, people can widely exchange and share different topics on the social network at any time. Meanwhile, the social robot is produced by the rapid development of the artificial intelligence technology.

Due to the shape and color of the social robot, the reality is hard to distinguish. Therefore, it is necessary to develop research for detecting and classifying social robots. On one hand, the system can help the supervision department to trace the source and take different control measures for different types of social contact robots. For social robots with positive impact, they are allowed to carry out business and services normally within a certain range. And the social robot with negative influence is subjected to key management and control to limit breeding and development of the social robot. Therefore, a healthy and safe network environment is better created for real users, and the harmony and stability of the society are promoted.

At present, few researches are conducted on the classification problem of social robots in social networks, and the existing researches mainly include that account characteristics of the social robots are selected and then classified by classifiers, and the social robots mainly include the following types:

first, there is a document that classifies abnormal users into publishers of product marketing advertisements, content pollutors whose published contents are inconsistent with topic tag information, and attacks, 35881, and malicious-language publishers such as abuse, extracts user contents, behaviors, attributes, and relationship features from a social network data set, and selects an eXtreme Gradient Boosting (XGBoost) algorithm that can effectively utilize multidimensional features and is still effective when a sample set is severely unbalanced to construct a classification model.

Secondly, there is a document that a social account is divided into an active disturbance type garbage user, an excessive concern type garbage user, a repeated sending type garbage user, a marketing advertisement type garbage user and a normal user, a pair of multi-Support Vector Machines (SVMs) is selected to construct a multi-class classifier, and then fuzzy processing is performed by adopting fuzzy clustering to solve the problem of missing classification in the one-to-multi SVM.

Third, there is a document that proposes a classification method considering both a benign robot and a malicious robot. The social robots are classified into broadcasting robots, consuming robots and spam robots. The broadcast robot is managed by a specific organization, mainly for information dissemination purposes. While consuming robots are used to aggregate content from different sources and provide update services, spam robots are used to deliver malicious content, primarily encompassing malicious robots. The article first goes through plotting a Cumulative Distribution Function (CDF) of several key attributes to learn about the activity patterns of robots and human accounts. Then, corresponding classification characteristics are provided, and finally classification is carried out by adopting naive Bayes, random forests, support vector machines and logistic regression models.

Fourth, there is literature that divides social accounts into normal users, authenticated users, promoters, and trend hikers. Wherein the promoter includes issuing an account containing malicious url information. Trendspotters include accounts that publish tweets that are not related to the subject matter for advertising a particular product or service, and accounts that publish tweets that are related to the subject matter for opinion manipulation and political advertising. The document links similar accounts according to their shared applications and builds a markov random field model on the generated similar graph for classification.

The documents extract various account characteristics through research, then perform characteristic selection or draw a CDF curve to check whether the selected characteristics are effective, and finally perform multi-classification by adopting a machine learning method. However, these documents do not explicitly give classification criteria for different classes, nor do they propose targeted features to distinguish different classes of social robots, which are poorly interpretable.

Fifthly, by analyzing the characteristics of the pushtext published by each type of social robot, a more pertinent detection characteristic is provided. Social robots are classified as robots, e-persons, and human spammers. Where the vocabulary used by the robot's tweets is very limited, the tweets follow a very structured pattern. E-people tend to copy content from other sources, and their vocabulary is much larger than that of ordinary robots. Spammers abuse algorithms to issue a series of nearly indistinguishable tweets to spoof Twitter's spam detection protocol. Compared with a method for selecting common account characteristics, the method deeply analyzes the difference of different types of robots, summarizes rules and extracts characteristics, and further promotes the research of the social robot classification method.

In summary, it can be found that existing research is classified according to the behavior of the social robot and the extraction features of the blog content, but the behavior and the language of the robot account may be adjusted according to the changes of the detection mechanism and the generation technology. Therefore, the existing classification scheme cannot well identify the social robot in a different model paradigm, only can learn the characteristics of the existing type robot, and cannot evolve strain over time. Therefore, it is crucial to design a classification method that can adapt to the changing social robots.

Disclosure of Invention

Based on this, the embodiment of the invention provides a social robot classification method and system, so as to improve the universality and interpretability of the classification method.

In order to achieve the purpose, the invention provides the following scheme:

a social robot classification method, comprising:

acquiring the content of the blog about the target topic of the target social robot;

inputting the blog content of the target social robot about the target topic into a social robot classification model to obtain the category of the target social robot; the categories include content polluters, knowledge propagators and news reviewers; the social robot classification model comprises a topic relevance target model and a viewpoint sentence identification target model;

the content polluters represent that the blog content published by the target social robot is not related to the target topic; the knowledge propagator expresses that the blog content published by the target social robot is related to the target topic, and issues opinions and expresses viewpoints; the news reviewer represents that the blog content published by the target social robot is related to the target topic, and propagates information and explains objective events;

the determination method of the social robot classification model comprises the following steps:

constructing a source domain data set based on a transfer learning method; the source domain data sets comprise a first type of data set and a second type of data set; the first type of data set comprises original blog content crawled on a microblog platform and issued by an account under a set topic, original blog content issued by an account under a topic related to the set topic and a corresponding classification label; the classification label comprises that the account belongs to a content polluter or an account data knowledge propagator; the second type of data set comprises viewpoint type blog texts generated by a social robot sample data generation model and marked to be issued by account numbers of news reviewers;

determining a target domain dataset based on a social robot recognition model; the target domain data set comprises social robot real blog content with labeled categories;

expanding set topics in the source domain data set and compressing topic contents to obtain a topic expansion sequence;

determining the topic relevance target model according to the source domain data set, the target domain data set, the topic expansion sequence and the twin network; the topic relevance target model is used for identifying a content pollutant;

determining the viewpoint sentence recognition target model according to the source domain data set, the target domain data set, a rule-based viewpoint sentence recognition method and a text classification model; the opinion sentence recognition target model is used for knowledge propagators and news reviewers.

Optionally, the determining the topic relevance target model according to the source domain data set, the target domain data set, the topic expansion sequence and the twin network specifically includes:

inputting the source domain data set and the topic expansion sequence into a twin network, carrying out preliminary training on the twin network by taking the minimum mean square error function as a target, and determining a similarity threshold of the twin network; when the account number in the source domain data set belongs to a content pollutant, the similarity between the original blog content and the topic expansion sequence is smaller than the similarity threshold value;

preliminarily training and determining a twin network with a good similarity threshold as a topic correlation source model;

adopting the target domain data set and the corresponding target domain topic filling sequence to finely adjust the similarity threshold value of the topic correlation source model;

and determining the topic relevance source model after the similarity threshold is finely adjusted as the topic relevance target model.

Optionally, the determining the viewpoint sentence recognition target model according to the source domain data set, the target domain data set, the rule-based viewpoint sentence recognition method, and the text classification model specifically includes:

extracting sentence features of the source domain data set; the sentence characteristics comprise keyword characteristics, position characteristics, semantic characteristics and length characteristics;

normalizing the sentence characteristics and weighting and summing to obtain the viewpoint sentence score of each sentence;

determining a viewpoint sentence threshold value of a rule-based viewpoint sentence recognition model according to the viewpoint sentence score;

training a convolutional neural network by adopting the data with the viewpoint sentence score smaller than the viewpoint sentence threshold value in the target domain data set, and determining the trained convolutional neural network as a text classification model;

a viewpoint sentence identification source model is formed by the viewpoint sentence identification model based on rules determined by the viewpoint sentence threshold value and the text classification model;

fine-tuning a viewpoint sentence threshold value and a convolutional neural network parameter in the viewpoint sentence recognition source model by adopting the target domain data set;

and determining the fine-tuned viewpoint sentence recognition source model as the viewpoint sentence recognition target model.

Optionally, the expanding and compressing topic content on the set topic in the source domain data set to obtain a topic expansion sequence specifically includes:

crawling the lead contents of the topics related to the set topics, and generating extended documents from the lead contents of all related topics;

and extracting key words from the extended document by adopting a graph-based sorting algorithm for texts to obtain a topic extended sequence.

Optionally, the determining a target domain data set based on the social robot recognition model specifically includes:

checking by adopting a social robot recognition model to obtain real data of the type of the social robot;

carrying out manual marking and blog duplicate removal on the real data to obtain effective social robot data;

determining the valid social robot data as a target domain data set.

Optionally, the twin network is a pre-trained transform-based bi-directional encoder.

Optionally, the inputting the blog content of the target social robot about the target topic into a social robot classification model to obtain the category of the target social robot specifically includes:

inputting the blog content of the target social robot about the target topic into a topic correlation target model, and identifying whether the target social robot is a content pollutant or not;

and if the target social robot is not a content pollutant, inputting the blog content of the target social robot about the target topic into the opinion sentence identification target model, and identifying whether the target social robot is a knowledge propagator or a news reviewer.

The invention also provides a social robot classification system for realizing the method, which comprises the following steps:

the system comprises a blog content acquisition module, a target topic search module and a target topic search module, wherein the blog content acquisition module is used for acquiring the blog content of a target social robot about a target topic;

the classification identification module is used for inputting the blog content of the target social robot about the target topic into a social robot classification model to obtain the category of the target social robot; the categories include content polluters, knowledge propagators and news reviewers; the social robot classification model comprises a topic relevance target model and a viewpoint sentence identification target model;

the content polluters represent that the blog content published by the target social robot is not related to the target topic; the knowledge propagator expresses that the blog content published by the target social robot is related to the target topic, and issues opinions and expresses viewpoints; the news reviewers express that the blog content published by the target social robot is related to the target topic, and spread information and explain objective events;

a classification model determination module for determining the social robot classification model;

the classification model determining module specifically includes:

the source domain data set construction unit is used for constructing a source domain data set based on a transfer learning method; the source domain data sets comprise a first class of data sets and a second class of data sets; the first type of data set comprises original blog content crawled on a microblog platform and issued by an account number under a set topic, original blog content issued by the account number under a topic related to the set topic and a corresponding classification label; the classification label comprises that the account belongs to a content polluter or an account data knowledge propagator; the second type of data set comprises opinion type blog articles generated by a social robot sample data generation model and marked to be issued by account numbers of news reviewers;

the target domain data set construction unit is used for determining a target domain data set based on the social robot recognition model; the target domain data set comprises social robot real blog content with labeled categories;

the topic expansion and compression module is used for expanding and compressing the set topics in the source domain data set to obtain a topic expansion sequence;

a topic relevance target model determining module for determining the topic relevance target model according to the source domain data set, the target domain data set, the topic expansion sequence and the twin network; the topic relevance target model is used for identifying a content pollutant;

a viewpoint sentence recognition target model determining module, configured to determine the viewpoint sentence recognition target model according to the source domain data set, the target domain data set, a rule-based viewpoint sentence recognition method, and a text classification model; the opinion sentence recognition target model is used for knowledge propagators and news reviewers.

Compared with the prior art, the invention has the beneficial effects that:

the embodiment of the invention provides a social robot classification method and a system, wherein the method comprises the following steps: acquiring the blog content of the target social robot about the target topic; inputting the blog content into a social robot classification model to obtain the category of the target social robot; the social robot classification model comprises a topic relevance target model and a viewpoint sentence identification target model; the method for determining the social robot classification model comprises the following steps: constructing a source domain data set based on transfer learning; determining a target domain dataset based on a social robot recognition model; expanding and compressing the set topics in the source domain data set; determining a topic relevance target model according to the source domain data set, the target domain data set, the compressed topic expansion sequence and the twin network; and determining a viewpoint sentence recognition target model according to the source domain data set, the target domain data set, the rule-based viewpoint sentence recognition method and the text classification model. The method realizes classification of the social robots based on transfer learning, performs text mining on the blog content published by the social robots, and improves the universality and interpretability of the classification method.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a flowchart of a social robot classification method according to an embodiment of the present invention;

FIG. 2 is a flowchart of a method for determining a classification model of a social robot according to an embodiment of the present invention;

FIG. 3 is an overall process diagram of a social robot classification method according to an embodiment of the present invention;

FIG. 4 is a diagram of a topic relevance model framework provided by an embodiment of the invention;

FIG. 5 is a diagram of the SBERT model architecture provided by the embodiment of the present invention;

FIG. 6 is a flow chart of knowledge propagator and news reviewer identification provided by an embodiment of the present invention;

FIG. 7 is a framework diagram of a deep-migration learning model based on a network according to an embodiment of the present invention;

fig. 8 is a structural diagram of a social robot classification system according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Interpretation of terms:

the social robot comprises: the social robot is a virtual robot that is currently active in a social network, and is actually an automated program that can simulate human behavior in the social network by using social accounts and applying relevant technologies such as artificial intelligence.

The content polluter: the content polluter refers to a social robot account for publishing a blog article unrelated to the topic.

News reviewers: the news reviewer refers to a social robot account which issues relevant content and topics, issues opinions and expresses viewpoints.

The knowledge propagator: the knowledge propagator refers to a social robot account which is related to the published content and topic, propagates information and explains the objective condition of the event.

SBERT (sequence embedding using Siamese BERT-Networks): a twin network model based on pre-trained BERT. The sub-networks of the SBERT model all use the BERT model, and both BERT models share parameters. When comparing the similarity of A and B sentences, they are input into BERT network, and the output is two groups of vectors characterizing sentences, then the similarity of both sentences is calculated.

TextCNN: a text classification model that utilizes a Convolutional Neural Network (CNN) to classify a process text problem.

The TextRank algorithm: a graph-based ranking algorithm for text, the basic idea being to treat a document as a network of words, with links in the network representing semantic relationships from word to word. The TextRank algorithm mainly comprises the following steps: extracting key words, key phrases and key sentences.

BERT (bidirectional Encode replies from Transformer): the transform-based bi-directional encoder representation is a pre-trained language characterization model.

And (3) pooling operation: the pooling operation is a very common operation in CNN, and is also commonly called sub-sampling (Subsampling) or down-sampling (Downsampling), and when a convolutional neural network is constructed, the pooling operation is often used after a convolutional layer, so that a characteristic dimension of the convolutional layer output is reduced through pooling, network parameters are effectively reduced, and an over-fitting phenomenon can be prevented.

The MEAN strategy: mean-pooling is one of the pooling strategies.

Word2Vec Word vector model: word2Vec is a neural network-based language model and is also a vocabulary characterization method. Word2Vec includes two structures: skip-gram (skip word model) and CBOW (continuous bag of words model), but are essentially a lexical dimension reduction operation.

Since the purpose of each social robot is unique and invariant, all actions are to achieve the final goal. Therefore, it would be more effective to directly study the purpose of social robots than to analyze behaviors to extract formal features. Different types of social robots achieve the purpose of the social robots, take different actions aiming at the same event, issue different statements and make different contributions to the development trend of the event.

Therefore, in order to overcome the defects in the prior art, the invention provides a social robot classification method based on transfer learning, which is used for text mining of blog articles issued by social robots and judging the functions of the social robots in specific events according to text mining results, so that the classification of the social robots is completed.

Aiming at the problem that the existing literature classification method cannot adapt to the classification task of the novel social robot, the invention provides the social robot classification method with more universality and interpretability by considering the purpose of the social robot. In addition, the existing social robot classification research also generally has the problem of serious sample shortage, so the invention also researches how to design a social robot classification method with better effect under the condition of limited sample data.

Aiming at the problems that the semantics of topic texts in the blog relevance judgment task under the specific topic of the Sina microblog platform are sparse and the lengths of the topic texts and the microblog texts are greatly different, the invention provides a more effective relevance judgment method on the basis of the existing text relevance judgment model, thereby enriching the semantic information of the topic texts and more accurately identifying content pollutors.

Aiming at the problems that the viewpoint sentence generated according to the social robot viewpoint sentence generation principle is lack of human writing skills, does not have good viewpoint sentence characteristics and is difficult to detect by the existing viewpoint sentence identification method, the invention sets a proper viewpoint sentence identification rule, constructs a viewpoint sentence identification model, identifies from the key language corpus key word angle, and makes up the defect that the generated viewpoint sentence is not smooth and information is incomplete and is difficult to identify by the existing method, thereby improving the accuracy of viewpoint sentence identification.

Fig. 1 is a flowchart of a social robot classification method according to an embodiment of the present invention. Referring to fig. 1, the social robot classification method of the embodiment includes:

step 101: and acquiring the blog content of the target social robot about the target topic.

Step 102: inputting the blog content of the target social robot about the target topic into a social robot classification model to obtain the category of the target social robot; the categories include content pollutors, knowledge propagators and news reviewers; the social robot classification model comprises a topic relevance target model and a viewpoint sentence identification target model.

The content polluters represent that the blog content published by the target social robot is not related to the target topic; the knowledge propagator expresses that the blog content published by the target social robot is related to the target topic, and issues opinions and expresses viewpoints; the news reviewers represent that the blog content published by the target social robots is related to the target topics, and propagate information and explain objective events.

The step 102 specifically includes:

inputting the blog content of the target social robot about the target topic into a topic relevance target model, and identifying whether the target social robot is a content pollutant or not.

Fig. 2 is a flowchart of a method for determining a social robot classification model according to an embodiment of the present invention. Referring to fig. 2, the method for determining the social robot classification model includes:

step 201: constructing a source domain data set based on a transfer learning method; the source domain data sets include a first type of data set and a second type of data set. The first type of data set comprises original blog content crawled on a microblog platform and issued by an account under a set topic, original blog content issued by an account under a topic related to the set topic and a corresponding classification label; the classification label comprises that the account belongs to a content polluter or an account data knowledge propagator; the second type of dataset includes opinion type bonuses generated by a social robot sample data generation model that have been tagged for posting by accounts of news reviewers.

Step 202: determining a target domain dataset based on a social robot recognition model; the target domain data set comprises social robot real blog content with labeled categories.

Step 202, specifically comprising:

and checking by adopting a social robot recognition model to obtain the real data of the type of the social robot.

And after manual marking and duplicate removal of the blog text are carried out on the real data, effective social robot data are obtained.

Determining the valid social robot data as a target domain data set.

Step 203: and expanding the set topics in the source domain data set and compressing the topic contents to obtain a topic expansion sequence.

Step 203, specifically comprising:

and crawling the guide content of the topic related to the set topic, and generating an extended document by the guide content of all related topics.

And extracting keywords from the extended document by adopting a graph-based ranking (TextRank) algorithm for texts to obtain a topic extended sequence.

Step 204: determining the topic relevance target model according to the source domain data set, the target domain data set, the topic expansion sequence and the twin network; the topic relevance target model is used to identify content taints.

Step 204, specifically comprising:

constructing a twin network; the twin network is a pre-trained Transformer based bi-directional encoder (SBERT).

Inputting the source domain data set and the topic expansion sequence into a twin network, carrying out preliminary training on the twin network by taking the minimum mean square error function as a target, and determining a similarity threshold of the twin network; and when the account number in the source domain data set belongs to a content pollutant, the similarity between the original blog content and the topic expansion sequence is smaller than the similarity threshold value.

And preliminarily training to determine a twin network with a good similarity threshold as a topic correlation source model.

And adopting the target domain data set and the corresponding target domain topic filling sequence to finely adjust the similarity threshold value of the topic correlation source model.

Step 205: determining the opinion sentence recognition target model according to the source domain data set, the target domain data set, a rule-based opinion sentence recognition method and a text classification model (TextCNN); the viewpoint sentence recognition target model is used for knowledge propagators and news reviewers.

Step 205, specifically including:

extracting sentence features of the source domain data set; the sentence features include keyword features, position features, semantic features, and length features.

And carrying out normalization processing and weighting summation on the sentence characteristics to obtain the viewpoint sentence score of each sentence.

A viewpoint sentence threshold value of a rule-based viewpoint sentence recognition model is determined based on the viewpoint sentence score.

And training a convolutional neural network by adopting the data with the viewpoint sentence score smaller than the viewpoint sentence threshold value in the target domain data set, and determining the trained convolutional neural network as a text classification model.

The rule-based opinion sentence recognition model determined by the opinion sentence threshold and the text classification model constitute an opinion sentence recognition source model.

And fine-tuning the viewpoint sentence threshold value and the convolutional neural network parameters in the viewpoint sentence recognition source model by adopting the target domain data set.

And determining the finely adjusted viewpoint sentence recognition source model as the viewpoint sentence recognition target model.

In practical applications, a more specific implementation process of the social robot classification method is as follows:

the specific example provides a social robot classification method on the basis of early-stage social robot identification. Firstly, topic expansion is carried out by adopting topic related guide words, and on the basis, an SBERT (sequence-BERT) model is applied to carry out correlation judgment on the blogged text and the expanded topic so as to identify the content pollutors. Then, a viewpoint sentence recognition method combining social robot viewpoint sentence generation rules with a deep learning model TextCNN is proposed to further distinguish news reviewers from knowledge propagators. Finally, in order to improve the classification effect of the model, a transfer learning method is utilized, and model training is carried out by means of a large amount of blog data from microblog common account numbers, so that the classification effect of the social robot is better improved. The comparison experiment result shows that the classification result of the SBERT model on the relevance of the blog topics can be effectively improved by expanding the topics; by analyzing the opinion blog generating rule of the social robot and focusing attention on keywords expressing opinions, the problem of difficult opinion sentence identification caused by low quality of the opinion sentences generated by the social robot is well solved; by introducing the transfer learning, the problem of insufficient data volume of the social robot is effectively solved, and the classification effect of the social robot is greatly improved. The overall process of the social robot classification method is shown in fig. 3.

The social robot classification method in this example includes the following specific implementation steps:

step 1, constructing a data set.

The source domain data set consists of two parts of data: and in the first part, data acquisition is carried out by compiling crawler codes. Original blog content issued by all accounts under "# XXXX #" and related topics on a microblog platform is crawled, 400 blogs with contents irrelevant to the topics and 400 news blogs are marked in a manual marking mode and are respectively used as source data of content pollutors and knowledge propagators; and in the second part, 1200 viewpoint type blog articles are generated through a social robot sample data generation model and serve as source data of news reviewers. Thus completing the construction of the source domain data set of the transfer learning.

The target domain data set is 188 real data of robot types obtained through the social robot recognition model inspection, and after manual labeling and blog text de-duplication, 139 effective social robot data are included. There were 59 content pollutors, 63 news reviewers and 17 knowledge propagators.

And 2, identifying the content polluter. Adopting an SBERT model to carry out relevance judgment on the extended topics and the texts of the messages, and marking the account number which issues the messages irrelevant to the topics as a content pollutant, thereby completing the identification of the content pollutant, as shown in FIG. 4, the whole process is as follows:

step 21, the topic expansion module comprises two parts of topic expansion and text compression. Firstly, relevant contents of the topics are collected and expanded, so that semantic information of the topics is more complete. Then, the extended topics are compressed, and important key phrases are extracted, so that the problem of data sparseness is avoided. And finally, calculating the correlation degree of the expansion topic and the blog by adopting an SBERT model.

Step 22, in the topic expansion part, the invention adopts the topic content related to the topic and the topic guide part to expand the content. The "XXXX" event that is expected to be a hot topic is taken as an example. Firstly, a topic containing 'XXXX' is searched in a microblog, the lead contents of all # XX # of a home page are crawled, and the leads of the topics are organized together as a document. Because these words are an introduction to each topic and are checked by the microblog, they are sentences related to the XXXX topic.

Step 23, in the text compression part, the invention adopts a TextRank algorithm to extract keywords from the expanded topic contents. The basic idea of the TextRank algorithm is derived from the PageRank algorithm of Google, and a graph model is built by dividing a microblog text into a plurality of constituent units (words and sentences). The important components in the microblog texts are sorted through a voting mechanism, and keyword extraction can be realized only by using the information of a single microblog text.

The process of extracting the guide words by the TextRank algorithm is as follows:

1) And segmenting the topic content expanded by the guide words according to the complete sentence.

2) And for each sentence, operations of carrying out ending word segmentation, part-of-speech tagging, stop word removal and the like are sequentially carried out.

3) And constructing a topic candidate keyword graph G = (V, E), wherein V is a node set and consists of topic candidate keywords generated in the second step, then constructing an edge between any two points by adopting a co-occurrence relation, the edges exist between the two nodes only when words corresponding to the two nodes co-occur in a window with the length of K, and K represents the size of the window, namely the maximum number of the co-occurring K words. The design principle of the TextRank algorithm is to link adjacent words, calculate the score of the word by the score of the adjacent words, and construct edges so as to obtain the adjacency relation between the words. In the subsequent steps, each word updates the score of the word according to the scores of other words in the edge.

4) And iteratively propagating the score of each node according to a TextRank formula until convergence.

Wherein, V _i 、V _j Representing different nodes; WS (V) _i ) Represents a node V _i The degree of importance of; d is a damping coefficient, the value range of d is 0 to 1, the probability that a certain point in the graph points to any other point is represented, and the value of d is generally 0.85; in (V) _i ) Is a point V _i Point set of (in-chain set), out (V) _j ) Is a point V _j Pointed set of pointsAnd (8) merging (out-chain collection). | Out (V) _j ) Is the number of out-links, and each word is to give each out-link its own score average,

meaning V _j Donated to V _i Fraction of (A) V _i All the scores of all the in-chains contributing to him are added up, namely V _i The score of itself.

5) And 4) sequencing the node scores obtained in the step 4), and taking the most important k words as the keywords of the guide words, thereby completing the construction of the topic expansion sequence.

Step 24, adopting an SBERT model to judge topic relevance, as shown in FIG. 5, the process is as follows:

1) And inputting the Bowen text and the expanded topic, and obtaining two characteristic vectors u and v representing sentences after BERT coding.

2) And performing posing operation by adopting an MEAN strategy, acquiring all output vectors of the last layer of the sequence, and calculating the average value of the output vectors as sentence vectors.

3) And adopting a mean square error function as an optimized objective function, and calculating the cosine similarity of u and v of the obtained sentence vector to measure the similarity of the topic and the blog. And when the similarity is greater than the similarity threshold value, judging that the topic is related to the blog. Since different thresholds will eventually produce different results, the threshold with the highest accuracy is taken as the final threshold.

And 3, identifying the news reviewer and the knowledge propagator. And (4) performing first viewpoint sentence recognition aiming at the viewpoint sentence characteristic extracted by the Bo language, and completing further viewpoint sentence recognition by using a TextCNN model. The classification of news reviewers and knowledge propagators is completed according to whether the blog is related to topics or not and whether the blog has views or not, as shown in fig. 6, the overall flow is as follows:

and step 31, identifying the viewpoint sentence based on the generation rule. From the perspective of social bot text generation, in order for news reviewers to express a particular perspective, the producers typically employ a corpus containing particular keywords. Specifically, a new viewpoint sentence can be generated by inputting a viewpoint sentence example of a specific event corpus and then adopting a synonym transformation. It is also possible to extract sentence pattern features by inputting a viewpoint sentence example of a general corpus, and then generate a viewpoint sentence about a specific event according to a given keyword and a given emotion. Therefore, according to the social robot viewpoint sentence generation rule, the viewpoint sentence is identified from four angles of the keyword feature, the position feature, the semantic feature and the length feature, and the process is as follows:

1. dividing each social robot blog into n sentences, wherein d = { s = {(s) } ₁ ,s ₂ ,...,s _n }. Each sentence is then divided into l words, s _i ＝{w _i1 ,w _i2 ,...,w _il }. Thus, by performing a weighted summation of the 4 features of each sentence, a sentence-viewpoint sentence score can be obtained, as shown in equation (2). Wherein λ ₁ ,λ ₂ ,λ ₃ ,λ ₄ Each represents the weight of 4 features, and the sum of the four is 1. Lambda can be adjusted appropriately according to the situation ₁ ,λ ₂ ,λ ₃ ,λ ₄ The value of (c).

f(s _i )＝λ ₁ f _keyword (s _i )+λ ₂ f _position (s _i )+λ ₃ f _semantics (s _i )+λ ₄ f _length (s _i ) (2)

2. And extracting keyword features, position features, semantic features and length features.

The key word characteristics are as follows: and extracting keywords with parts of speech as nouns and verbs from all the blog articles published by the social robots by using a TextRank algorithm. When a key sentence is selected, if a keyword appears in the sentence, the probability that the sentence becomes a viewpoint sentence is very high. Thus, the score function for the keyword features is as follows:

position characteristics: in a blog that expresses views, the central view is generally at the beginning or end. Therefore, the first sentence or the last sentence of the blog is more likely to be a viewpoint sentence, and the score function of the position feature is as follows:

wherein n is the number of sentences in the text, i is the position of the sentences in the text, and a, b and c are coefficients. As can be seen from the functional expression, the sentence at the middle of the text has a low score, and the sentences at the beginning and the end have higher scores.

Semantic features: in the opinion sentence, words expressing the opinion and words with strong subjectivity, summarization and turning are often appeared. Typical words expressing a point of view are: "support, objection, resistance, consent, trust, strength", and the like. Words that represent subjectivity such as: "I, think, estimate, should, perhaps, approximate," and the like. Words that represent summaries are: "thus, in summary, so, in summary, described above, visible from this" and the like. Words that represent turning are: "however, although, albeit, nevertheless," etc. Thus, the scoring function for semantic features is as follows:

length characteristics: in consideration of the fact that news posters published by knowledge propagators are mostly objective fact descriptions for events, the posters are mostly long in length, and opinion sentences generated by news reviewers according to rules are relatively short. Thus, the present invention takes a length feature as one feature identified by a viewpoint sentence, and the score function of the length feature is as follows:

3. obtaining the opinion sentence score f(s) of each sentence of the social robot blog through normalizing the four characteristics of the sentences and weighting and summing _i ). Setting a viewpoint threshold value theta epsilon (0,1) whenSentence score f(s) _i ) If the score of a plurality of sentences appearing in one blog is larger than theta, the sentence with the highest score is taken as the viewpoint sentence, and the viewpoint sentence recognition based on the rule is completed. And, the account with the opinion sentence in the Bo West is judged as a news reviewer.

Step 32, viewpoint sentence recognition based on TextCNN. In order to recognize the viewpoint sentence more comprehensively, the invention inputs the blog text judged not to include the viewpoint sentence into the TextCNN model to perform further viewpoint sentence judgment. The specific identification process is as follows:

1) And (5) representing the blog by using a Word2Vec Word vector model, and taking the obtained Word vector as the input of the embedding layer.

2) The vectorized representation of the Bowen features is decimated through different filters.

3) And screening the most significant feature vectors through a pooling operation.

4) And converting the viewpoint sentence into a classification problem through a full connection layer, thereby completing the viewpoint sentence identification.

The four steps of recognition correspond to an embedding layer, a convolution layer, a pooling layer, a full link layer and a softmax layer of a viewpoint sentence recognition model (Word 2Vec model), and are four levels of the recognition model structure.

And 4, classifying the social robots based on the transfer learning. The deep learning model has high requirements on data volume, and the available data volume of the social robot is limited. Therefore, training of a social robot classification model is completed by transferring microblog data of human beings, so that classification of the social robot is realized, and the whole process is as follows:

the deep neural network is first pre-trained using a large amount of human data of the source domain, and then the network structure and network parameters are migrated to the target domain. Because the target data is less, the social robot blog text is a text generated by imitating the human blog text, and the similarity with the human blog text is very high, the social robot blog text and the human blog text are close to each other in both superficial network characteristics and deep network characteristics. Therefore, the parameters of all other networks except the output layer are frozen, the output layer is finely adjusted by using the training data of the social robot, and finally the classification effect of the model is tested by using the test data, as shown in fig. 7, the specific process is as follows:

step 41, the source domain data set is processed according to the following steps of 8:2, randomly dividing the data set into a source domain training set and a source domain testing set, and dividing the same target domain data set into a target domain training set and a target domain testing set;

step 42, training and testing a topic correlation model and a viewpoint sentence recognition model by adopting a source domain data set, and storing model parameters with the best test effect to obtain a topic correlation source model and a viewpoint sentence recognition source model;

step 43, fine-tuning a topic relevance source model and a viewpoint sentence identification source model by adopting a target domain data set, so as to obtain a topic relevance target model and a viewpoint sentence identification target model;

and step 44, combining the topic relevance target model and the viewpoint sentence identification target model to obtain a multi-classification model of the social robot, and classifying the social robot into a content pollutant, a news reviewer and a knowledge propagator according to 4 combined classification results obtained by the two classification models, namely a relevant viewpoint sentence, a relevant non-viewpoint sentence, an irrelevant viewpoint sentence and an irrelevant non-viewpoint sentence.

The social robot classification method of the embodiment has the following advantages:

(1) The topic expansion is provided for the problem that the length difference between the microblog text and the topic is large, so that data are sparse, and a topic blog relevance identification model based on a topic expansion module is established. Firstly, expanding topics by using guide texts of related topics; then, extracting keywords by using a TextRank to obtain a topic expansion sequence, so that topics can express event contents more abundantly and effectively; and finally, calculating the text correlation degree of the topic expansion sequence and the blog sequence by adopting an SBERT model, and realizing the identification of the content pollutors. The comparison experiment result shows that the topic expansion module can express the event content more abundantly and effectively, and the effect of judging the topic blog relevance is improved.

(2) And realizing viewpoint sentence identification by combining the generation rule and deep learning. Firstly, from the viewpoint sentence generation principle of a news comment social robot, extracting keyword features, position features, semantic features and length features from a blog, calculating viewpoint sentence scores of sentences according to feature value weighted summation and comparing the scores with a threshold, if the scores are larger than the threshold, identifying viewpoint sentences obviously having keywords in advance, and if the scores are smaller than the threshold, identifying the viewpoint sentences by establishing a TextCNN model (identifying the viewpoint sentences obviously having the viewpoint keywords in advance by adopting a rule-based method, and then identifying the viewpoint sentences of the rest sentences by establishing the TextCNN model). Experimental results show that the method based on combination can fully exert the respective advantages of the two methods. Partial viewpoint sentences which accord with the blog generating rule of the social robot and are possibly not smooth can be accurately identified through the keywords; the textCNN model can further identify the residual viewpoint sentences as many as possible, and the accuracy of social robot viewpoint sentence identification is greatly improved.

(3) Strategies for category classification based on the purpose of social robots are designed. By text mining of the social robot blog, the social robot blog is further divided into three categories, namely a content pollutant, a news reviewer and a knowledge propagator according to different functions of the social robot under a certain topic. On the basis of the strategy, aiming at the problems of rare data and low detection accuracy of the social robots, a social robot classification method based on transfer learning is also designed. By means of the transfer learning technology, the blog data training results of a large number of common accounts can be transferred to the classification model of the social robot, and the universality and the interpretability of the classification method are improved. The comparison experiment result proves that the social robot classification model is more accurate by means of training with the help of the transfer learning technology.

(4) When the behavior characteristics of the social robot set by the attacker change, the method of the embodiment is still suitable for classifying the changed social robot. In addition, the provided social robot classification strategy depends on a text mining technology, so that the source data of the transfer learning can use the blog data of normal users which are easy to collect and rich in quantity, and the problem that the social robot cannot perform normal learning due to the fact that similar sample account numbers are difficult to obtain is solved.

Besides, in addition to the specific implementation manner of the above embodiment, social robot classification may also be performed by performing blog topic relevance judgment by using other models to identify content pollutors, and performing opinion sentence judgment by using other models to identify news reviewers and knowledge propagators.

The invention also provides a social robot classification system for implementing the method, and referring to fig. 8, the system includes:

and a blog content obtaining module 801, configured to obtain blog content of the target social robot about the target topic.

A classification model determination module 802 for determining the social robot classification model.

A classification identification module 803, configured to input the blog content of the target social robot about the target topic into a social robot classification model, so as to obtain a category of the target social robot; the categories include content pollutors, knowledge propagators and news reviewers; the social robot classification model comprises a topic relevance target model and a viewpoint sentence identification target model.

The content polluters represent that the blog content published by the target social robot is not related to the target topic; the knowledge propagator expresses that the blog content published by the target social robot is related to the target topic, and issues opinions and expresses viewpoints; the news reviewer represents that the blog content published by the target social robot is related to the target topic, and propagates information and explains objective events.

The classification model determining module 802 specifically includes:

the source domain data set construction unit is used for constructing a source domain data set based on a transfer learning method; the source domain data sets comprise a first class of data sets and a second class of data sets; the first type of data set comprises original blog content crawled on a microblog platform and issued by an account under a set topic, original blog content issued by an account under a topic related to the set topic and a corresponding classification label; the classification label comprises that the account belongs to a content polluter or an account data knowledge propagator; the second type of data set includes opinion-type blossoms generated by a social robot sample data generation model that have been tagged as published by accounts of news reviewers.

The target domain data set construction unit is used for determining a target domain data set based on the social robot recognition model; the target domain data set comprises the social robot real blog content with the labeled category.

And the topic expansion and compression module is used for expanding and compressing the set topics in the source domain data set to obtain a topic expansion sequence.

A topic relevance target model determining module, configured to determine the topic relevance target model according to the source domain data set, the target domain data set, the topic expansion sequence, and the twin network; the topic relevance target model is used to identify content pollutors.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A social robot classification method, comprising:

inputting the blog content of the target social robot about the target topic into a social robot classification model to obtain the category of the target social robot; the categories include content pollutors, knowledge propagators and news reviewers; the social robot classification model comprises a topic relevance target model and a viewpoint sentence identification target model;

constructing a source domain data set based on a transfer learning method; the source domain data sets comprise a first class of data sets and a second class of data sets; the first type of data set comprises original blog content crawled on a microblog platform and issued by an account under a set topic, original blog content issued by an account under a topic related to the set topic and a corresponding classification label; the classification label comprises that the account belongs to a content polluter or an account data knowledge propagator; the second type of data set comprises opinion type blog articles generated by a social robot sample data generation model and marked to be issued by account numbers of news reviewers;

determining a target domain dataset based on a social robot recognition model; the target domain data set comprises the real blog content of the social robot with the labeled category;

expanding the set topics in the source domain data set and compressing topic contents to obtain a topic expansion sequence;

determining the topic relevance target model according to the source domain data set, the target domain data set, the topic expansion sequence and the twin network; the topic relevance target model is used for identifying content polluters;

2. The social robot classification method according to claim 1, wherein the determining the topic relevance target model according to the source domain data set, the target domain data set, the topic augmentation sequence and the twin network specifically comprises:

3. The social robot classification method according to claim 1, wherein the determining the opinion sentence recognition target model according to the source domain data set, the target domain data set, a rule-based opinion sentence recognition method and a text classification model specifically comprises:

4. The method for classifying social robots as claimed in claim 1, wherein the expanding and topic content compressing the set topics in the source domain data set to obtain a topic expansion sequence specifically comprises:

5. The method according to claim 1, wherein the determining a target domain data set based on a social robot recognition model specifically comprises:

determining the valid social robot data as a target domain data set.

6. The method of claim 2, wherein the twin network is a pre-trained transform-based bi-directional encoder.

7. The social robot classification method according to claim 1, wherein the step of inputting the blog content of the target social robot about the target topic into a social robot classification model to obtain the category of the target social robot comprises:

8. A social robot classification system for implementing the method of any one of claims 1-7, comprising:

the classification identification module is used for inputting the blog content of the target social contact robot about the target topic into a social contact robot classification model to obtain the category of the target social contact robot; the categories include content pollutors, knowledge propagators and news reviewers; the social robot classification model comprises a topic relevance target model and a viewpoint sentence identification target model;

the classification model determining module specifically comprises:

the source domain data set construction unit is used for constructing a source domain data set based on a transfer learning method; the source domain data sets comprise a first class of data sets and a second class of data sets; the first type of data set comprises original blog content crawled on a microblog platform and issued by an account under a set topic, original blog content issued by an account under a topic related to the set topic and a corresponding classification label; the classification label comprises that the account belongs to a content polluter or an account data knowledge propagator; the second type of data set comprises opinion type blog articles generated by a social robot sample data generation model and marked to be issued by account numbers of news reviewers;

the target domain data set construction unit is used for determining a target domain data set based on the social robot recognition model; the target domain data set comprises the real blog content of the social robot with the labeled category;

a topic relevance target model determining module, configured to determine the topic relevance target model according to the source domain data set, the target domain data set, the topic expansion sequence, and the twin network; the topic relevance target model is used for identifying content polluters;