CN115952453A

CN115952453A - Identification method, device, equipment and storage medium of social media robot

Info

Publication number: CN115952453A
Application number: CN202211663812.5A
Authority: CN
Inventors: 李慧; 郭超; 韦崴; 李健鹏
Original assignee: China Electronics Industry Engineering Co ltd
Current assignee: China Electronics Industry Engineering Co ltd
Priority date: 2022-12-23
Filing date: 2022-12-23
Publication date: 2023-04-11

Abstract

The application relates to a training method, an identification method, a device, equipment and a storage medium of a social media robot identification model, belonging to the technical field of social media user identification. The application includes: acquiring user data to be trained, wherein the user data is provided with manually labeled class labels, and the method comprises the following steps: robots, suspected robots, and non-robots; carrying out feature extraction on the user data, and carrying out normalization processing on the extracted feature data; the characteristic data at least includes: file characteristics, language characteristics, emotion characteristics and time sequence characteristics; inputting the characteristic data after the normalization processing into a pre-constructed graph attention network model for training until the graph attention network model is converged to obtain a social media robot recognition model; the social media robot identification model determines a category label of the current user. By the method and the device, the problem that in the prior art, due to the fact that emotional characteristics in user data are ignored, the recognition accuracy of the social media robot is low is solved.

Description

Identification method, device, equipment and storage medium of social media robot

Technical Field

The application belongs to the technical field of social media safety management, and particularly relates to a social media robot identification method, device, equipment and storage medium.

Background

With the rapid development of online social networks, social platforms such as Facebook, twitter, microblog and the like have become important channels for acquiring, spreading and publishing information. The social media robot is an abnormal user of an automatic social media controlled by an algorithm, can simulate the social behaviors of normal human beings on a social platform and interacts with normal users, and a large number of viewpoints and opinions form strong social opinions which influence the judgment of the public. Accordingly, identification techniques for social media bots have evolved. By identifying the social media robots, the probability that the users belong to the robots, the suspected robots and the non-robots is judged, and then the users of the robots and the suspected robots are logged out or controlled, so that the spread of malicious contents is reduced, and the safety of a network space is guaranteed.

At present, a method for identifying a social media robot extracts features of user accounts in social media, trains an identification model according to the extracted features, identifies users in the social media, and detects the social media robot. However, because the account number features in the social media are numerous, the extracted features in the existing identification method are mainly related to user-related features, content features and partial social relation features, and important emotional features are ignored, so that the identification accuracy of the social media robot is low, and the feature values in the features of the user data are uneven and have a large range, so that good normalization cannot be performed, and the training speed and the identification accuracy of the identification model are influenced.

Disclosure of Invention

Therefore, the method, the device, the equipment and the storage medium for identifying the social media robot are provided, and the problem that in the prior art, due to the fact that emotional features in user data are ignored, the identification accuracy of the social media robot is low is solved.

In order to achieve the purpose, the following technical scheme is adopted in the application:

in a first aspect, the present application provides a method for training a social media robot recognition model, including:

acquiring user data to be trained, wherein the user data is provided with manually marked category labels;

the category label includes: a robot, a suspected robot, and a non-robot;

carrying out feature extraction on the user data, and carrying out normalization processing on the extracted feature data;

the characteristic data at least comprises: file characteristics, language characteristics, emotion characteristics and time sequence characteristics;

inputting the normalized characteristic data into a pre-constructed graph attention network model for training until the graph attention network model is converged to obtain a social media robot recognition model;

the social media robot identification model is used for receiving the feature data after normalization processing, calculating the probability that the feature data after normalization processing belong to different class labels, and taking the class label corresponding to the maximum probability value as the class label of the current user.

Further, if the feature data includes a profile feature, performing feature extraction on the user data, including:

and extracting at least one of the user name length, the nickname length, the registration time length, whether a default file is adopted, the number of friends, the number of fans, the number of concerns, the number of tweets forwarding, the number of tweets mentioning, the number of replies and the number of tweets forwarded.

Further, if the feature data includes a language feature, performing feature extraction on the user data, including:

extracting words of different parts of speech in user data, including: at least one of a verb, a noun, an adjective, an emotional helpword, a preposition, an interjective, an adverb, and a pronoun;

for each part of speech, extracting the number and ratio of words of the part of speech in each tweet; and respectively calculating the minimum value, the maximum value, the median value, the average value, the standard deviation, the skewness, the kurtosis and the entropy of the part of speech in all the tweets according to the number and the ratio of the words of the part of speech in all the tweets.

Further, if the feature data includes an emotional feature, performing feature extraction on the user data, including:

extracting words and expressions of different emotion indexes in the user data, wherein the words and expressions comprise at least one of happiness, valence, awakening degree, positive expression, negative expression and total expression;

for each emotion index, extracting to obtain the score of the emotion index in each tweet; and respectively calculating the minimum value, the maximum value, the median value, the average value, the standard deviation, the skewness, the kurtosis and the entropy of the emotion index in all the tweets according to the scores of the emotion index in all the tweets.

Further, if the feature data includes a time series feature, performing feature extraction on the user data, including:

extracting at least one of the time for sending the tweed by the user, the time for forwarding the tweed by the user and the time for mentioning the tweed by the user from the user data;

respectively calculating the time interval of sending out two tweets, forwarding the two tweets and mentioning the two tweets;

and respectively calculating the minimum value, the maximum value, the median value, the average value, the standard deviation, the skewness, the kurtosis and the entropy in the time interval of sending two tweets, forwarding the two tweets and referring to the two tweets according to the time intervals of all the tweets.

Further, the normalizing the extracted feature data includes:

presetting corresponding segment values according to different value ranges of the extracted feature data;

and carrying out normalization processing on the extracted feature data according to the preset segmentation value.

Further, the inputting the normalized feature data into a pre-constructed graph attention network model for training includes:

constructing a graph attention network model of a G layer, and initializing;

determining an adjacent user corresponding to the user according to social information in user data;

the feature data after the normalization processing is used as input feature data of a first layer of the graph attention network model;

calculating an attention coefficient corresponding to each attention head, the user and the adjacent user according to the input characteristic data of the user and the input characteristic data of the adjacent user;

according to the input feature data of the adjacent users and the corresponding attention coefficients, calculating to obtain output feature data of each attention head of the layer of the users;

for any layer from the first layer to the G-1 layer of the graph attention network model, splicing the output characteristic data of all the attention heads of the layer of the user to obtain the output characteristic data of the layer of the user; the output characteristic data of the layer is input characteristic data of the next layer of the user;

for the G layer of the graph attention network model, averaging output characteristic data of all attention heads of the layer of the user to obtain a prediction result of the user;

and updating the graph attention network model by using a cross entropy loss function according to the prediction results of all users until the graph attention network model converges to obtain a social media robot identification model.

In a second aspect, the present application provides a social media robot recognition method, including:

acquiring user data of a social media to be identified;

inputting the feature data after normalization processing into a social media robot recognition model obtained by training with any one of the above methods to obtain a category label of the user data;

the category label includes: robots, suspected robots, and non-robots.

In a third aspect, the present application provides an identification apparatus for a social media robot, the apparatus comprising:

the data acquisition module is used for acquiring user data to be trained, and the user data is provided with manually labeled class labels; the category label includes: robots, suspected robots, and non-robots;

the characteristic extraction is used for carrying out characteristic extraction on the user data and carrying out normalization processing on the extracted characteristic data; the characteristic data at least comprises: file characteristics, language characteristics, emotion characteristics and time sequence characteristics;

the model training module is used for inputting the characteristic data after the normalization processing into a pre-constructed graph attention network model for training until the graph attention network model is converged to obtain a social media robot recognition model;

In a fourth aspect, the present application provides an identification device for a social media robot, comprising:

a memory having an executable program stored thereon;

a processor for executing the executable program in the memory to implement the steps of any of the above methods.

In a fifth aspect, the present application provides a computer readable storage medium having stored thereon computer instructions for causing a computer to perform the steps of any of the methods described above.

This application adopts above technical scheme, possesses following beneficial effect at least:

by additionally extracting the emotional features and the time sequence features in the user data, the dimension covered by the feature data for model training is more comprehensive, the class label of the current user can be accurately judged on the basis of the trained model, the situations of mistaken identification, mistaken identification and missed identification caused by neglecting the emotional features in the user data are reduced, and the user experience is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a flow diagram illustrating a method of training a social media robot recognition model in accordance with an exemplary embodiment;

FIG. 2 is a flow diagram illustrating a method of feature extraction for the user data in accordance with an exemplary embodiment;

FIG. 3 is a flow diagram illustrating yet another method of feature extraction for the user data in accordance with an illustrative embodiment;

FIG. 4 is a flow diagram illustrating yet another method of feature extraction for the user data in accordance with an illustrative embodiment;

FIG. 5 is a flow diagram illustrating training in a graph attention network model in accordance with an exemplary embodiment;

FIG. 6 is a block diagram illustrating an identification apparatus of a social media bot in accordance with an exemplary embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail below. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the examples given herein without making any creative effort, shall fall within the protection scope of the present application.

Referring to fig. 1, fig. 1 is a flowchart illustrating a method for training a social media robot recognition model according to an exemplary embodiment, where the method includes: the method comprises the following steps:

s11, obtaining user data to be trained, wherein the user data is provided with manually labeled class labels; the category label includes: robots, suspected robots, and non-robots.

Specifically, the user data obtained by the crawler or from the public database in the last period of time of Twitter may be information of all users between 2000 and 2020, including but not limited to registration time, nickname, avatar, friend, user concerned and attended user of each user, and information of all tweets.

Each account has only one user operation, so that one account corresponds to one user and is labeled with a corresponding category label. For the account number of the social media robot, the category label is marked as the robot; for the account which cannot be judged to belong to a normal user or a social media robot, marking the category label as a suspected robot; for a normal user account, the category label is labeled as a non-robot.

S12, extracting the characteristics of the user data, and normalizing the extracted characteristic data; the characteristic data at least comprises: profile features, linguistic features, emotional features, and timing features.

Specifically, multi-dimensional feature extraction is performed on the acquired user data, so that feature data of the user, which contains a plurality of pieces of dimensional information, is obtained. The extracted feature data includes at least one of archival features, linguistic features, emotional features, and temporal features. The profile characteristics are used for representing the basic characteristics of the state of a user in the registration and use processes of an account; the language features are used for representing the related features of words in the text pushing content of the user; the emotional characteristics are used for representing the characteristics related to emotional emotion presented by the user in the context; the timing features are used to characterize time-related features of a user's socialization with an account.

Due to the fact that multi-dimensional feature extraction is carried out, the dimensions covered by the feature data of the user are more comprehensive, the represented information of the user is more extensive, particularly, the extraction of emotional features is more beneficial to the follow-up training of the social media robot recognition model, the trained social media robot recognition model has more dimensional features, and the recognition accuracy of the social media robot in the user is improved.

In the extracted feature data, the feature value span of some feature data is large, for example, for the feature data of vermicelli quantity, the vermicelli quantity of some users (such as net red star) can reach 1000 ten thousand, and the vermicelli quantity of some users (such as common users) is only 20, the value range of the feature value of the feature data is from 0 to several tens of thousands, the span is large, and after normalization, the feature data is limited in a certain range (such as [0,1] or [ -1,1 ]), so that the convergence speed of the social media robot recognition model can be accelerated.

S13, inputting the characteristic data after the normalization processing into a pre-constructed graph attention network model for training until the graph attention network model is converged to obtain a social media robot recognition model; the social media robot identification model is used for receiving the feature data after normalization processing, calculating the probability that the feature data after normalization processing belong to different class labels, and taking the class label corresponding to the maximum probability value as the class label of the current user.

Specifically, a graph attention network model is built and initialized by combining a graph neural network and an attention layer mechanism. Because the graph neural network can aggregate a large amount of information in the graph structure data, and the attention layer mechanism can only use the useful information of the adjacent nodes by setting different weights, the graph attention network model can extract the useful information from a large amount of feature data on the basis of fully utilizing a large amount of feature data, and the training speed of the graph attention network model is improved.

By additionally extracting the emotional features and the time sequence features in the user data, the dimension covered by the feature data for model training is more comprehensive, the class label of the current user can be accurately judged on the basis of the trained social media robot recognition model, the situations of mistaken recognition, mistaken recognition and missed recognition caused by neglecting the emotional features in the user data are reduced, and the user experience is improved.

The trained social media robot recognition model can receive the feature data after normalization processing, calculate the probability that the feature data belong to 3 different types of labels, namely, a robot, a suspected robot and a non-robot, and take the type label corresponding to the maximum probability value as the type label of the current user.

For example: and if the probabilities that the obtained feature data after the normalization processing belong to 3 different types of labels, namely, a robot, a suspected robot and a non-robot are respectively 0.7,0.1 and 0.4, determining that the type label of the user of the feature data is the robot.

It should be noted that the technical solution provided in this embodiment is applicable to identifying social media robots. For example, for users in a social platform such as Facebook, twitter, microblog and the like, the probability that the current user belongs to the robot, the suspected robot and the non-robot is judged, and the category label corresponding to the maximum probability value is used as the category label of the current user, so that a basis is provided for logout or control over the users of which the category labels are the robot and the suspected robot, so that spread of malicious content is reduced, and the security of a network space is facilitated to be guaranteed. In specific practice, the technical solution provided by this embodiment needs to be loaded in an electronic device or a controller for operation.

It can be understood that, according to the technical scheme provided by this embodiment, by adding and extracting the emotional features and the timing features in the user data, the dimensions covered by the feature data for model training are more comprehensive, and the class label to which the current user belongs can be more accurately determined by the social media robot recognition model trained on the basis, so that the situations of false recognition, false recognition and missed recognition caused by neglecting the emotional features in the user data are reduced, and the user experience is improved.

In an embodiment, in step S12, if the feature data includes a profile feature, performing feature extraction on the user data includes: and extracting at least one of the user name length, the nickname length, the registration time length, whether a default file is adopted, the number of friends, the number of fans, the number of concerns, the number of tweets forwarding, the number of tweets mentioning, the number of replies and the number of tweets forwarded.

Specifically, the "user name length" of the user refers to the number of characters of the user name used by the social media account login system;

"nickname length" refers to the number of characters of the nickname used by the social media account;

"registration duration" refers to the number of years from the registration of the social media account to the time of extracting the feature data;

the phrase "whether to adopt the default profile" means whether the signature and the background of the social media account adopt the system default, and if any one of the signature and the background adopts the system default, the social media account is considered to adopt the default profile, otherwise, the social media account does not adopt the default profile. For the social media account adopting the default profile, setting the characteristic value of 'whether to adopt the default profile' as 1; for social media accounts that do not adopt the default profile, the characteristic value of "whether to adopt the default profile" is set to 0.

"number of friends" refers to the number of friends that the social media account has;

"fan number" refers to the number of fans owned by the social media account;

"amount of attention" refers to the total amount of social media account attention and focus;

the "number of tweets, number of tweets forwarding, number of tweets mentioning, number of replies, and number of tweets forwarded" refer to the number of tweets, forwarding, mentioning, replies, and forwarded, respectively, of the social media account.

The archival features in the feature data extracted by the user should be as many as possible, and may include one or more of the above features, which is not specifically limited in this application.

It can be understood that, in the technical scheme provided by this embodiment, by extracting the archive features in the user data, the feature data for model training covers the registration information of the user and the state and frequency of the use account, and the social media robot recognition model trained on this basis has multi-dimensional user basic features, and provides powerful support for accurately judging the category label to which the current user belongs.

In an embodiment, as shown in a flowchart of a method for extracting features of the user data shown in fig. 2 according to an exemplary embodiment in step S12, if the feature data includes a language feature, the extracting features of the user data includes:

step S201, extracting words with different parts of speech in the user data, including: at least one of a verb, a noun, an adjective, an emotional helpword, a preposition, an interjective, an adverb, and a pronoun.

Specifically, feature extraction is performed on the content of the tweet in all the user data, wherein the tweet at least comprises at least one of 8 types of words, namely verbs, nouns, adjectives, emotional helpers, prepositions, interjections, adverbs and pronouns, and feature extraction is performed on the 8 types of words. For example: the verb 'bounce' is extracted.

And S202, for each part of speech, extracting the number and the ratio of the words of the part of speech in each tweet.

Specifically, for the 8 part-of-speech words, the number of words of each part-of-speech in each piece of tweed is extracted, and the ratio of the words of each part-of-speech in all the words of the piece of tweed is calculated.

For example: for a piece of tweet sent by the user a, the number of verbs is 2, and the number of words with 8 parts of speech is 10 in total, then, in the piece of tweet, the number of verbs is 2, which accounts for 20%.

Step S203, respectively calculating the minimum value, the maximum value, the median value, the average value, the standard deviation, the skewness, the kurtosis and the entropy of the number of the words of the part of speech and the ratio of the words of the part of speech in all the tweets.

Specifically, for each part-of-speech word, the number of the part-of-speech word in each tweed is determined, wherein the minimum value of the number is used as the minimum value of the part-of-speech word in all tweed, and the maximum value of the number is used as the maximum value of the part-of-speech word in all tweed.

And sequencing the number of the words of the part of speech in each tweet to obtain the median of the number of the words of the part of speech as the median of the number of the words of the part of speech.

And averaging the number of the words of the part of speech in all the tweets to obtain the average value of the number of the words of the part of speech.

And calculating the standard deviation of the number of the words of the part of speech in all the tweets to obtain the standard deviation of the number of the words of the part of speech.

Calculating the number of the words of the part of speech in all the tweets according to the following formula:

wherein X _i The number of the word of the part of speech in the ith tweed, n is the number of the tweed, mu is the average value of the number of the word of the part of speech, and sigma is the standard deviation of the number of the word of the part of speech.

And obtaining the skewness S of the number of the words of the part of speech, and measuring the asymmetry of the number of the words of the part of speech in the text by using the skewness S.

And solving the kurtosis K of the number of the words of the part of speech in all the tweets according to the following formula:

as above, X _i The number of the word of the part of speech in the ith tweed, n is the number of the tweed, mu is the average value of the number of the word of the part of speech, and sigma is the standard deviation of the number of the word of the part of speech.

And obtaining the kurtosis K of the number of the words of the part of speech, and measuring the degree of steepness of the number of the words of the part of speech in the text by using the kurtosis K.

The entropy H of the number of the words of the part of speech in all the tweets is obtained according to the following formula:

wherein, X _i The number of the word of the part of speech in the ith tweed, n is the number of tweeds, p (X) _i ) The number of the word of the part of speech in the ith tweed is the ratio of the number of the word of the part of speech in all tweeds.

And obtaining the entropy K of the number of the words of the part of speech, and measuring the uncertainty of the number of the words of the part of speech in the tweet by using the entropy K.

And according to the same method, calculating the minimum value, the maximum value, the median value, the average value, the standard deviation, the skewness, the kurtosis and the entropy of the occupation ratio of the words of the part of speech in all tweets.

Thus, for each part of speech, the minimum value, the maximum value, the median value, the average value, the standard deviation, the skewness, the kurtosis and the entropy of the number and the ratio of the words of the part of speech in all tweets are obtained, and 16 features are calculated. According to the same method, the minimum value, the maximum value, the median value, the average value, the standard deviation, the skewness, the kurtosis and the entropy of the number and the occupation ratio of the 8 part-of-speech words in all the tweets are obtained, and 128 features are calculated.

It can be understood that, according to the technical scheme provided by this embodiment, by calculating the number of words of a plurality of parts of speech and a plurality of feature values of the occupation ratio, richness of data expression in feature data is increased, and more comprehensive than simple selection of one of the features, so that more user information is possessed during subsequent training of the social media robot recognition model, and thus the recognition accuracy of the social media robot recognition model is improved.

In an embodiment, as shown in fig. 3, which is a flowchart of another method for extracting features of the user data according to an exemplary embodiment, in step S12, if the feature data includes an emotional feature, the extracting features of the user data includes:

step S301, extracting words and expressions of different emotion indexes in the user data, wherein the words and expressions comprise at least one of happiness, valence, awakening degree, control degree, positive expression, negative expression and total expression.

Specifically, for each piece of tweet sent by the user, words and expressions related to happiness, valence, arousal degree, control degree, positive expression, negative expression and total expression are extracted according to words and expressions of different emotion indexes.

Step S302, extracting and obtaining the score of the emotion index in each tweet for each emotion index.

Specifically, for each emotional index of the happiness, the valence, the arousal degree, the control degree, the positive expression, the negative expression and the total expression, in each item of context, a score of the happiness for representing the satisfaction degree of the user, a score of the valence for representing the emotional state of the user, a score of the arousal degree for representing the arousal degree of the arousal emotion of the user, and a score for representing the control degree of the emotion of the user are calculated, wherein the number of the positive expressions of the user in the context is used as the score of the positive expression, the number of the negative expressions is used as the score of the negative expression, and the number of the total expressions is used as the score of the total expression.

The happiness score is calculated from public data according to the research result of Kloumann at the university of Buddmont in the United states, and the valence, arousal degree and control degree scores are calculated from the public data according to the research result of Amy Beth Warriner at the university of Mermaster.

And step S303, respectively calculating the minimum value, the maximum value, the median value, the average value, the standard deviation, the skewness, the kurtosis and the entropy of the emotion index in all the tweets according to the score of the emotion index in all the tweets.

For the score of each emotion index in happiness, valence, arousal degree, control degree, positive expression, negative expression and total expression, the minimum value, the maximum value, the median value, the average value, the standard deviation, the skewness, the kurtosis and the entropy of the emotion index in all contexts are calculated according to the same method and formula in the step S203, wherein each emotion index obtains 8 characteristic values, and then 56 characteristic values are obtained in total for 7 emotion indexes.

It can be understood that, according to the technical scheme provided by this embodiment, scores of multiple emotion indexes used for representing emotions are obtained by extracting words and expressions of different emotion indexes of the tweet in the user data, and the scores are calculated according to different data distribution methods to obtain multiple different feature values, so that the emotion characteristics in the user data are fully utilized and taken as training data of the social media robot recognition model, and dimensions covered by the feature data are more comprehensive.

In an embodiment, as shown in fig. 4, which is a flowchart of another method for feature extraction on the user data according to an exemplary embodiment, in step S12, if the feature data includes a time-series feature, the feature extraction on the user data includes:

step S401, extracting at least one of the time when the user sends the text, the time when the user forwards the text and the time when the user mentions the text in the user data.

Specifically, the user data includes ways of sending, forwarding, or mentioning, collecting and the like, the push text directly sent by the user is regarded as the push text sent by the user, the push text forwarded by the user is regarded as the forwarding push text, other push text ways are regarded as the mentioning push text, and at least one of the time when the user sends the push text, the time when the user forwards the push text and the time when the user mentions the push text is extracted respectively.

And step S402, respectively calculating the time interval of sending out two tweets, forwarding the two tweets and mentioning the two tweets.

Specifically, the numerical value of each two tweet time intervals may be a year, a month, an hour, a minute, and the like, which is not specifically limited in this application.

And S403, respectively calculating the minimum value, the maximum value, the median value, the average value, the standard deviation, the skewness, the kurtosis and the entropy in the time interval of sending two tweets, forwarding the two tweets and referring to the two tweets according to the time intervals of all the tweets.

According to the same method and formula for calculating the minimum value, the maximum value, the median, the average value, the standard deviation, the skewness, the kurtosis and the entropy in the step S203, the minimum value, the maximum value, the median, the average value, the standard deviation, the skewness, the kurtosis and the entropy of the tweet time interval of each category are calculated, and the 8 characteristics are obtained. And each time interval of sending out two tweets, forwarding the two tweets and referring to the two tweets, the tweet time intervals of the 3 categories obtain characteristic values of 24 characteristics in total.

It can be understood that, in the technical scheme provided in this embodiment, a plurality of features representing a time sequence are obtained by calculating the types and times of different tweets, and a plurality of different feature values are obtained according to different data distribution methods, so that the expression of the time sequence features in the feature data is increased, and when a social media robot recognition model is trained subsequently, more information in the aspect of the user time sequence is possessed, thereby improving the recognition accuracy of the social media robot recognition model.

In an embodiment, in step S12, normalizing the extracted feature data includes:

presetting corresponding segment values according to different value ranges of the extracted feature data; and carrying out normalization processing on the extracted feature data according to the preset segmentation value.

Specifically, in the profile characteristics of the user, the user name length of the user, which is characteristic data, has a certain word number requirement, which is generally not more than 50, that is, the value range of the user name length of the user, which is characteristic data, is 0-50; the feature value of the feature data of the number of the fans can be thousands of millions, namely the value range of the feature data of the number of the fans is 0-5 million. When normalization is carried out, the value ranges of the two feature data are different, the span is large, different segment values are preset for different feature data, better normalization can be carried out, and when a social media robot recognition model is trained subsequently, the convergence speed can be improved, and the training of the model is accelerated. For example, the segment value is set to 10 for the user name length of the user, and 1 ten thousand for the fan number.

It should be noted that, according to different value ranges of the feature data, the preset segment value is obtained through verification by an experiment, and the specific experiment method is not specifically limited in the present disclosure.

And (4) normalizing the extracted characteristic data by adopting a method of combining a log function and an atan function. For each feature data, calculating to obtain corresponding normalized feature data according to the following formula:

v is a characteristic value of the characteristic data, v' is a characteristic value of the corresponding normalized characteristic data, n is a preset segmentation value, and different values are taken according to different value ranges of the characteristic value of each characteristic data.

It can be understood that, in the technical scheme provided by this embodiment, by adopting a method of combining a log function and an atan function, setting different segment values, and mapping the feature values of the feature data, the feature values can be normalized to a desired range, and the problem that the feature values have a large numerical span and are difficult to be well normalized is solved. The social media robot recognition model is trained by using the feature data after the normalization processing, so that the gradient descent speed can be increased while enough training sample data are ensured, the model training time is shortened, the reliability of the training result is ensured, the feature values of the feature data with different dimensions are in a comparable range, and the accuracy of the recognition result is greatly improved.

In an embodiment, as shown in fig. 5, which is a flowchart illustrating training in a graph attention power network model according to an exemplary embodiment, in step S13, inputting the feature data after the normalization processing into a graph attention power network model constructed in advance for training, includes:

and S501, constructing a graph attention network model of the G layer, and initializing.

Specifically, a graph attention network model is constructed by using a combination of a graph neural network and a multi-head attention layer, wherein the number of layers of the graph attention network model is set as G, and the value of G is not specifically limited in the present disclosure. The graph neural network can well process data with large amount of information, the multi-head attention layer is helpful to extract only useful information from the large amount of data, and the graph attention network model combining the graph neural network and the multi-head attention layer can well process the large amount of data and extract useful information from the large amount of data.

The initialization graph focuses on the number of multiple heads in the network model, and the corresponding weight matrix and weight vector in each head in each layer of the network model.

Step S502, according to social information in the user data, determining an adjacent user corresponding to the user.

Specifically, for user data, in order to fully utilize data features of social relationships, two users related to attention or concerned relationships can be used as adjacent users, the graph attention network model only concerns feature data of the adjacent users, and different weights are assigned to the feature data, so that more important user data obtains higher weights in the model training process, and the recognition accuracy of the trained model is improved.

And S503, taking the feature data after the normalization processing as input feature data of a first layer of the graph attention network model.

Specifically, for each user, all feature values in the feature data obtained after the normalization process form a feature vector corresponding to the user, and the feature vector is used as input feature data and is input into the first layer of the graph attention network model.

Step S504, calculating the attention coefficient of each attention head corresponding to the user and the adjacent user according to the input characteristic data of the user and the input characteristic data of the adjacent user.

Specifically, the attention coefficient α between the user i and the adjacent user j in each attention head in the g layer is calculated according to the following formula _ij ：

Wherein the content of the first and second substances,

for the input characteristic data of the user i on the g level of the network model, ->

For the g layer of the network model, the input characteristic data of user j, user i and user j are adjacent users, W _g For a weight matrix in the attention head in the g-layer of the network model, -based on the weight matrix in the g-layer>

Is the weight vector in the attention head in the g layer of the network model, N _i For the number of users adjacent to user i,

input feature data for user m.

According to the formula, attention coefficients of all attention heads, all users and adjacent users in the g layer are obtained.

And step S505, calculating to obtain output characteristic data of each attention head of the layer of the user according to the input characteristic data of the adjacent users and the corresponding attention coefficients.

Specifically, the output characteristic data of the user i in the attention head k of the g layer is calculated according to the following formula

Wherein the content of the first and second substances,

for the purpose of noting the input characteristic data of the user j in the force head k in the g layer of the network model, <' > H>

For the weight matrix in the attention head k, α, in the g layer of the network model _ij ^k For attention head k, the attention coefficient alpha between user i and user j _ij 。

And obtaining the output characteristic data of each user of all attention heads in the g layer according to the formula.

Step S506, for any layer from the first layer to the G-1 layer of the graph attention network model, splicing the output characteristic data of all the attention heads of the layer of the user to obtain the output characteristic data of the layer of the user; the output characteristic data of the layer is the input characteristic data of the next layer of the user.

For any layer from the first layer to the G-1 layer of the graph attention network model, all attention heads of each user are spliced according to the following formula to obtain output characteristic data of the user i in the G layers

Wherein the content of the first and second substances,

the attention head k, which is the g-layer of the network model, outputs characteristic data of the user i.

And splicing according to the formula to obtain output characteristic data of all users in the g layer, and taking the output characteristic data of the users in the g layer as input characteristic data of the users in the g +1 th layer.

And step S507, for the G layer of the graph attention network model, averaging output characteristic data of all attention heads of the layer of the user to obtain a prediction result of the user.

Specifically, in the last layer G of the graph attention network model, for the user i, the average value of the output characteristic data of all the attention heads is calculated according to the following formula, and the prediction result of the user i is obtained through calculation

Wherein, as in the above,

for the g-layer of the network model, attention is paid to the output characteristic data of user i in head k.

And calculating according to the formula to obtain the prediction results of all users.

And step S508, updating the graph attention network model by using a cross entropy loss function according to the prediction results of all users until the graph attention network model converges to obtain a social media robot recognition model.

Specifically, the prediction result of the user includes the probability that the user is predicted as each category label.

And updating the graph attention network model by using the cross entropy loss function as a loss function of the graph attention network model. According to the prediction results of all users, calculating a loss function L according to the following formula:

wherein S is the number of category labels, N is the number of users, y _i,s The manually marked category label of the user i is s, and when the manually marked category label of the user i is s, y is _i,s 1 is taken, when the manually labeled category label of the user i is not s, y _i,s Take 0,p _i,s Representing the probability that user i is predicted to be an s category label.

And repeatedly executing the steps S503-S508, calculating a loss function L, updating the parameters in the graph attention network model until the graph attention network model converges, and stopping updating the parameters to obtain the social media robot identification model.

It can be understood that, in the technical scheme provided by this embodiment, the graph attention network model is established by combining the graph neural network and the multi-attention layer, so that the established graph attention network model can not only process data with a large amount of information well, but also extract only useful information from the large amount of data, and the model only focuses on feature data between adjacent users, so that social relationships in user data are fully used, and the trained social media robot recognition model has more diversified user information, thereby improving the recognition accuracy of the social media robot recognition model.

It can be understood that, the technical scheme provided by this embodiment fully integrates various features in the social media user, especially the emotional features and social relations of the user, so that the dimensions covered by the feature data for model training are more comprehensive, and the feature data are normalized by adopting a reasonable data normalization method, and the social media robot recognition model trained on this basis can more accurately judge the category label to which the current user belongs, thereby reducing the situations of false recognition, false recognition and missed recognition caused by neglecting the emotional features in the user data, and improving the user experience.

The application also provides a social media robot identification method, which comprises the following steps:

step S71, user data of the social media to be identified is acquired.

Step S72, extracting the characteristics of the user data, and normalizing the extracted characteristic data; the characteristic data at least comprises: profile features, linguistic features, emotional features, and timing features.

Step S73, inputting the feature data after the normalization processing into the social media robot recognition model obtained through the training to obtain a category label of the user data; the category label includes: robots, suspected robots, and non-robots.

Inputting the feature data after normalization processing into the trained social media robot recognition model, predicting the probability that the social media to be recognized respectively belongs to the category label robot, the suspected robot and the non-robot by the social media robot recognition model, and selecting the category label with the maximum probability as the type of the social media to be recognized.

It can be understood that, according to the technical scheme provided by this embodiment, the social media robot recognition model mentioned in the above embodiment recognizes the category tag of the social media user to be recognized, and the social media robot recognition model extracts the emotional feature and the time sequence feature in the user data by adding, so that the dimension covered by the feature data subjected to model training is more comprehensive.

Referring to fig. 6, fig. 6 is a block diagram illustrating an identification apparatus of a social media robot according to an exemplary embodiment, including:

the data acquisition module 61 is used for acquiring user data to be trained, wherein the user data is provided with manually labeled category labels; the category label includes: robots, suspected robots, and non-robots.

A feature extraction module 62, configured to perform feature extraction on the user data, and perform normalization processing on the extracted feature data; the characteristic data at least comprises: profile features, linguistic features, emotional features, and timing features.

And the model training module 63 is configured to input the feature data after the normalization processing into a pre-constructed graph attention network model for training until the graph attention network model converges, so as to obtain a social media robot recognition model.

And the social media robot identification model 64 is configured to receive the feature data after the normalization processing, calculate probabilities that the feature data after the normalization processing belong to different category tags, and use the category tag corresponding to the maximum probability value as the category tag of the current user.

It can be understood that, in the technical scheme provided by this embodiment, the graph attention network model mentioned in the above embodiment is trained to obtain the social media robot recognition model, and the social media robot recognition model makes the dimensions covered by the feature data for model training more comprehensive by adding and extracting the emotional features and the timing features in the user data, so that the class label to which the current user belongs can be more accurately determined by the model trained on the basis, the situations of false recognition, false recognition and missed recognition caused by ignoring the emotional features in the user data are reduced, and the user experience is improved.

The application also provides an identification device of a social media robot, comprising:

a memory having an executable program stored thereon;

Further, the present application provides a computer readable storage medium having stored thereon computer instructions for causing a computer to perform the steps of any of the methods described above. The storage medium may be a magnetic Disk, an optical Disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD) or a Solid State Drive (SSD), etc.; the storage medium may also comprise a combination of memories of the kind described above.

It is understood that the same or similar parts in the above embodiments may be mutually referred to, and the same or similar parts in other embodiments may be referred to for the content which is not described in detail in some embodiments.

It should be noted that, in the description of the present application, the terms "first", "second", etc. are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. In addition, in the description of the present application, the meaning of "plurality" means at least two unless otherwise specified.

It will be understood that when an element is referred to as being "secured to" or "disposed on" another element, it can be directly on the other element or intervening elements may also be present; when an element is referred to as being "connected" to another element, it can be directly connected to the other element or intervening elements may also be present, and further, as used herein, connected may include wirelessly connected; the term "and/or" is used to include any and all combinations of one or more of the associated listed items.

Any process or method descriptions in flow charts or otherwise described herein may be understood as: represents modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and the scope of the preferred embodiments of the present application includes additional implementations in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.

It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.

In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.

Claims

1. A training method of a social media robot recognition model is characterized by comprising the following steps:

acquiring user data to be trained, wherein the user data is provided with manually labeled class labels;

the category label includes: robots, suspected robots, and non-robots;

2. The method of claim 1, wherein if the feature data comprises archival features, performing feature extraction on the user data comprises:

3. The method of claim 1, wherein performing feature extraction on the user data if the feature data includes a linguistic feature comprises:

for each part of speech, extracting the number and ratio of words of the part of speech in each tweet;

and respectively calculating the minimum value, the maximum value, the median value, the average value, the standard deviation, the skewness, the kurtosis and the entropy of the part of speech in all the tweets according to the number and the proportion of the words of the part of speech in all the tweets.

4. The method of claim 1, wherein if the feature data includes emotional features, performing feature extraction on the user data comprises:

for each emotion index, extracting to obtain the score of the emotion index in each tweet;

and respectively calculating the minimum value, the maximum value, the median value, the average value, the standard deviation, the skewness, the kurtosis and the entropy of the emotion index in all the tweets according to the scores of the emotion index in all the tweets.

5. The method of claim 1, wherein if the feature data comprises a time series feature, performing feature extraction on the user data comprises:

extracting at least one of the time of sending the text by the user, the time of forwarding the text by the user and the time of mentioning the text by the user in the user data;

respectively calculating the time interval of sending two tweets, forwarding the two tweets and mentioning the two tweets;

6. The method according to claim 1, wherein the normalizing the extracted feature data comprises:

and carrying out normalization processing on the extracted feature data according to the preset segment value.

7. The method according to claim 1, wherein the inputting the normalized feature data into a pre-constructed graph attention network model for training comprises:

constructing a graph attention network model of a G layer and initializing;

according to the input feature data of the user and the input feature data of the adjacent users, calculating to obtain an attention coefficient of each attention head corresponding to the user and the adjacent users;

calculating to obtain output characteristic data of each attention head of the layer of the user according to input characteristic data of adjacent users and corresponding attention coefficients;

8. A social media robot recognition method, comprising:

acquiring user data of social media to be identified;

inputting the normalized feature data into a social media robot recognition model obtained by training according to the method of any one of claims 1 to 7 to obtain a category label of the user data;

the category label includes: robots, suspected robots, and non-robots.

9. An apparatus for identifying social media robots, the apparatus comprising:

the data acquisition module is used for acquiring user data to be trained, and the user data is provided with manually marked category labels; the category label includes: robots, suspected robots, and non-robots;

10. An identification device of a social media robot, comprising:

a memory having an executable program stored thereon;

a processor for executing the executable program in the memory to implement the steps of the method of any one of claims 1-7.

11. A computer-readable storage medium having stored thereon computer instructions for causing a computer to perform the steps of the method of any one of claims 1-7.