KR101804020B1

KR101804020B1 - Method for sns bot detection using geographic information

Info

Publication number: KR101804020B1
Application number: KR1020150156970A
Authority: KR
Inventors: 신원용; 김동건; 조재희
Original assignee: 단국대학교 산학협력단
Priority date: 2015-11-09
Filing date: 2015-11-09
Publication date: 2017-12-28
Also published as: KR20170054167A

Abstract

A tweet bot detecting method using spatial information of the present invention is disclosed. Determining whether an entropy value for time information between tweets is less than a threshold value for time information; and if the entropy value for time information between tweets is less than a threshold value for time information, Determining whether the entropy value is smaller than the threshold value for the distance information, and if the entropy value for the inter-tween distance information is smaller than the threshold value for the distance information, the user who continuously transmits the tweet is discriminated as the SNS bot .

Description

METHOD FOR SNS BOT DETECTION USING GEOGRAPHIC INFORMATION TECHNICAL FIELD [0001]

The present invention relates to an SNS bot detection method using spatial information, and more particularly, to a SNS bot detection method using spatial information for detecting a malicious twitch bot using geo-tagged tweet data from a Twitter server .

A tweet-bot is a compound word for social network services such as "twitter" and "bot," which is short for "robot." For example, TweetBot (@NDSL_kr) provided by Korea Institute of Science and Technology Information (KISTI) sends a web page address (URL) that can see related contents immediately after sending the kind of the desired data and search word through the mentions. Seismic bots (@earthquakebot) give real-time reports of seismic magnitudes of 5.0 or more from around the world. Seoul weather bots (@seoul_wt) every hour Seoul weather, @KBO scores Every 10 minutes.

There are also positive aspects of twitter bots that provide information and fun, but they are often used maliciously and often have side effects. In particular, twitter bots are often anonymous accounts, increasing the number of users who become obnoxious with obscenity, profanity, and obscene content. To solve this problem, technologies for detecting tweet bots have been developed.

The technique of detecting conventional tweet bots is an increasing trend of allowing users to open geospatial information (for example, check-in service), but does not utilize this information at all to perform twot bot detection. In addition, in performing the tweet bot detection, the smart device information provided in the source field in the data set is not utilized. That is, if the time information and the tweet text information are not obtained by the conventional technology, there is a problem that the tweet bot can not be detected.

SUMMARY OF THE INVENTION It is an object of the present invention to solve the above-mentioned problems by providing a method and apparatus for searching for inter-tweet time and inter-tweet distance using a geo-tagged tweet And to provide a method of detecting a tweet bot using spatial information that enables a tweet bot to be detected by comparing temporal and spatial patterns of a person and a tweet bot by calculating an entropy value of a variable.

It is another object of the present invention to solve the above-mentioned problems, and it is an object of the present invention to provide a method and system for detecting a tweet bot by using an entropy value of a distance variable between tweets and a set of selected devices, The present invention also provides a method of detecting a twotboat using spatial information so as to be able to detect a twotboat.

According to another aspect of the present invention, there is provided a method for detecting SNS bots using spatial information, comprising the steps of: constructing a data set comprising a geo-tagged tweet; Setting a threshold value for time information that enables the specified reliability using the tween time information that is continuously tweeted by the user; and setting a threshold value for the tweet's tween distance information that is continuously tweeted by the same user in the data set Determining whether an entropy value for the time information between tweets is less than a threshold value for the time information; If the entropy value for the time information is less than the threshold value for the time information, Determining whether an entropy value for the information is less than a threshold value for the distance information; and if the entropy value for the inter-tweet distance information is less than the threshold value for the distance information, As a non-SNS bot.

In constructing the data set, the data set is collected and configured through a streaming API.

In the step of constructing the data set, the user ID, the latitude of the device location transmitting the tweet, the hardness of the device location transmitting the tweet, the time of transmitting the tweet, the angle of the device Field is adopted.

In the step of setting the threshold value for the time information, the threshold value for the time information means an entropy value for the time information enabling the specified reliability.

In the step of setting the threshold value for the distance information, the threshold value for the distance information means an entropy value for the distance information that enables the specified reliability.

In the step of discriminating the SNS bot, the step of determining SNS bot may further include increasing SNS bot count b_count to obtain SNS bot detection probability (Bot DP).

The step of determining the SNS bot further includes a step of increasing the SNS user counter h_count to obtain a false alarm probability (FAP) when the SNS user is mistaken.

Determining whether an entropy value of the time information between tweets is less than a threshold value of the time information, and if the entropy value of the time information between tweets is not less than the threshold value of the time information, And determines the sender as the SNS user.

If the entropy value for the inter-tweet distance information is not smaller than the threshold value for the distance information, determining whether the entropy value of the inter-tweet distance information is smaller than the threshold value for the distance information, And determining that the sender is the SNS user.

According to another aspect of the present invention, there is provided a method for detecting SNS bots using spatial information, comprising the steps of: constructing a data set comprising a geo-tagged tweet; Setting a threshold value for distance information that enables specified reliability using tween distance information that is continuously tweeted by a user; selecting an apparatus for sending an tweet by the SNS user as an SNS user apparatus set Determining whether an entropy value of the distance information between tweets is smaller than a threshold value of the distance information; and if the entropy value of the distance information between tweets is less than the threshold value of the distance information, Determining whether the used device belongs to the set of SNS user devices, If the device does not belong to the SNS user equipment set comprises the step of determining the user continuously sent tweet SNS SNS bot not the user.

Determining whether the device used in the tweet belongs to the SNS user device set, if the device used in the tweet belongs to the SNS user device set, determining that the tweet user is the SNS user do.

According to the present invention, entropy values of two variables for an inter-tweet time and an inter-tweet distance are utilized by utilizing a geo-tagged tweet, By comparing temporal and spatial patterns of tweet bots, tweet bots can be detected more precisely.

In addition, the twitter bot can be more accurately detected through the entropy value of the distance variable between tweets and the selected device set using the smart device information of each user provided in the source field in the dataset.

In addition, by constructing a space DB (datadase) for the tweet bot, it can be used to grasp and detect spatial patterns of malicious bots in various social network services in the future.

1 is a flowchart showing an embodiment of a twot bot detecting method using spatial information according to the present invention.
2 is a flowchart showing another embodiment of a twot bot detecting method using spatial information according to the present invention.
FIG. 3 is a graph illustrating the detection probability of a tweet robot according to reliability when a twot bot detection method using spatial information according to an embodiment of the present invention is used.
FIG. 4 is a graph showing a correlation between a bot detection probability (Bot DP) and a false alarm probability (FAP) when the twot bot detection method using spatial information according to an embodiment of the present invention is used. FIG.
FIG. 5 is a graph illustrating a correlation between a tweet bot detection probability and a false alarm probability when the twot bot detection method using spatial information according to another embodiment of the present invention is used.
6 is a block diagram illustrating one embodiment of a smart device that performs methods in accordance with the present invention.

While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail. It should be understood, however, that the invention is not intended to be limited to the particular embodiments, but includes all modifications, equivalents, and alternatives falling within the spirit and scope of the invention.

The terms first, second, etc. may be used to describe various components, but the components should not be limited by the terms. The terms are used only for the purpose of distinguishing one component from another. For example, without departing from the scope of the present invention, the first component may be referred to as a second component, and similarly, the second component may also be referred to as a first component. And / or < / RTI > includes any combination of a plurality of related listed items or any of a plurality of related listed items.

It is to be understood that when an element is referred to as being "connected" or "connected" to another element, it may be directly connected or connected to the other element, . On the other hand, when an element is referred to as being "directly connected" or "directly connected" to another element, it should be understood that there are no other elements in between.

The terminology used in this application is used only to describe a specific embodiment and is not intended to limit the invention. The singular expressions include plural expressions unless the context clearly dictates otherwise. In the present application, the terms "comprises" or "having" and the like are used to specify that there is a feature, a number, a step, an operation, an element, a component or a combination thereof described in the specification, But do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, or combinations thereof.

Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the relevant art and are to be interpreted in an ideal or overly formal sense unless explicitly defined in the present application Do not.

Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. In order to facilitate the understanding of the present invention, the same reference numerals are used for the same constituent elements in the drawings and redundant explanations for the same constituent elements are omitted.

1 is a flowchart showing an embodiment of a twot bot detecting method using spatial information according to the present invention.

Referring to FIG. 1, a twot bot detecting method using spatial information according to an embodiment of the present invention is related to a two-step tweeter bot detecting method using time and distance information.

First, the twitter account is divided into a human and a tweet bot, which accounts for mass production of tweets that deliver news and news. But there are also malicious tweet bots that spread spam or malicious information, so you have to distinguish between people and tweets.

The present invention is to detect a malicious tweet bot using geo-tagged tweet data from a Twitter server, and calculates the entropy indicating uncertainty using the distance between tweets for each user.

In the present invention, first, a data set collected through a twitter streaming API is used. The dataset consists of large space tagged tweets recorded from Twitter users in specific geographic areas (e.g., Seoul, London, Los Angeles, etc.).

Each tweet includes a number of elements separated by the field name to which it belongs. For the point boundary detection technique of the present invention, the following five important fields are adopted from the metadata of the tweet.

user_id_str: String representation of the unique ID of a particular user

lat: The latitude of the device location that sent the tweet

lot: The longitude of the device location that sent the tweet

created_at: UTC / GMT time that tweet was sent

source: The smart device that created the Tweet

Classify user sets and tweet botsets using the ground-truth method. Analyze the contents of a tweet posted on a Twitter page and classify the user who repeatedly tweet the same message or URL regularly as a tweet bot. In addition, if a tweet is analyzed and the text contains spam content, the tweet user is considered as a tweet bot.

The user set is divided into two, the first set is the training set, and the second set is the test set together with the whole tweet bot set. Users are classified as user sets and tweet bot sets. However, those users who consume more than 300km / h between the tweets of the same tweets that are consecutively sent by the same user are reclassified as tweet bots. According to the present invention, a data set is composed of 892 human accounts and 115 twotbot accounts among a total of 1,007 user accounts according to the classification criteria.

Tweets tend to post tweets more regularly in time than people. Therefore, it can be concluded that the entropy of the time information between tweets in tweetbots is much smaller than that of humans.

user

end

The location information of the second tweet

ego,

The location information of the second tweet

, The geographical distance between two points

Can be obtained by using a spherical law of cosines.

Street

Limited to a maximum of 800 km

= 0 is divided first into 101 sections according to the sections as shown in Table 1 below.

Time between two consecutive tweets

The

As shown in Fig. time

Was limited to a maximum of 144 hours

= 0 was divided first and divided into 145 sections in 1 hour unit.

Table 1 below shows the geographic distance between two points when a user sends tweets at different points

) Is divided by intervals.

D _ij Range (km) Number of intervals 0 < D _ij < 0.1 0 One 0.1? D _ij <1 0.1 9 1? D _ij <3 0.5 4 3? D _ij <10 One 7 10? D _ij <800 10 79

The entropy set in the training set and the test set is set to

and

. In the training set

The entropy values of the time information between the tweets of the ith user and the distance information between the tweets

and

.

Likewise,

and

.

Also,

and

Means the total number of users included in the training set and the test set, respectively. At this time

= 446,

= 561.

The following is the basic formula for finding the entropy that represents uncertainty, using the records in the training set

The entropy value for the time information between the tweets of the ith user and the distance information between the tweets is calculated by the following equation (1).

here

And

The

Variables for the ith index

Wow

Probability distribution,

Represents the total number of data.

The method of detecting the tweet bot using the entropy value for each user is as follows.

First,

And the threshold value for each reliability is obtained. Here, the reliability is a probability of being included in a range specified by the user, and the threshold is an entropy value enabling the specified reliability.

The higher the confidence, the more people are within the specified range. For example, a variable that makes the reliability 80%

Is the entropy in the training set

To the minimum value of entropy such that 80% of the person is included from the maximum value of the entropy. The reason for specifying the reliability based on the maximum value is that the tweet bot tends to tweet more periodically than a person.

In the testing process, the tweet bot is detected in two stages based on the threshold value set previously. First, only the entropy for time information is used in the first step detection process. Entropy value for time information than threshold value

This little user is classified as a tweet bot.

Next, detection is performed according to the entropy of the distance information proposed in the present invention when performing the second step detection process. The smaller the variance of distance information between tweets is, the higher the likelihood of tweet bots is. Therefore, an arbitrary entropy threshold value is specified, and a user smaller than this value is identified as a tweet bot.

The minimum value of the entropy corresponding to the distance in the test set is designated, and a user having an entropy value smaller than this value is detected as a tweet bot.

First, a device for detecting a tweet bot using geo-tagged tweet data from a twitter server is a large-scale tagged tweet recorded from twitter users including SNS users and SNS tweet bots. and a geo-tagged tweet (S10).

The terminal sets a threshold value indicating an entropy value for time information enabling the specified reliability using the time information of tweets continuously tweeted by the same user in the configured data set (S11).

The terminal sets a threshold indicating the entropy value of the distance information that enables the specified reliability using the tween distance information continuously tweeted by the same user in the configured data set (S12).

The terminal determines whether the entropy value of the tweet time information of the user who continuously sends the tweet is smaller than the threshold value of the tween time information (S13).

The terminal determines whether the entropy value of the inter-tweet distance information is smaller than the threshold value of the inter-tween distance information for the user whose entropy value for the time information is smaller than the threshold value for the time information (S14).

If the entropy value for the distance information is smaller than the threshold value for the distance information, the terminal determines that the user is an SNS tweet bot (S15).

However, if it is determined in step S13 that the entropy value of the tweet time information is not less than the threshold value of the tweet time information, or if the entropy value of the tween distance information is smaller than the threshold value of the tween distance information (S16), it is determined that the user is an SNS user.

When the terminal is determined to be an SNS tweet bot, the terminal increases the twot bot count (b_count)

(S17). If the SNS user is mistaken, the user count (h_count) is increased to obtain a false alarm probability

(S18).

That is, the terminal determines that the entropy value of the tweet time information is smaller than the threshold value of the tween time information, and the entropy value of the tween distance information is smaller than the threshold value of the tween distance information. do.

2 is a flowchart showing another embodiment of a twot bot detecting method using spatial information according to the present invention.

Referring to FIG. 2, a tweet bot detection method using spatial information according to another embodiment of the present invention relates to a tweet bot detection method using tween distance information and a user device set.

Tweeter bots tend to move closer to zero, or move relatively regularly on a larger scale than humans. Therefore, it can be concluded that the entropy of the tweet bot 's distance between tweets is much smaller than that of humans.

The entropy set in the training set and the test set is set to

and

Respectively. In the training set

The entropy value of the distance information between the tweets of the ith user

. Likewise,

The entropy value of the distance information of the ith user

. At this time, using the records in the training set

The entropy of the distance information of the ith user is calculated by the following equation (2).

here

The

Distance information for the ith index

Probability distribution,

Represents the total number of data.

First, a device for detecting a tweet bot using geo-tagged tweet data from a twitter server is a large-sized tagged tweet (geo-tagged tweet) recorded from twitter users including a user and a tweet bot -tagged tweet) (S20).

The terminal sets a threshold indicating the entropy value of the distance information that enables the specified reliability by using the tween distance information continuously tweeted by the same user in the configured data set (S21).

The terminal selects the devices for sending the tweet by the SNS user and sets them as the SNS user device set (S22). The device set DV of the selected SNS user is defined as follows. The set of users' DVs includes an iphone, an iPad, a Windows for social network service (SNS) such as twitter, foursquare, instagram, Windows, android phone, and so on.

The terminal selects only the devices having a probability value of 0.5% or more in the distribution map of the device used by the SNS user to send tweets, and the selected devices are used by the SNS user.

The terminal determines whether the entropy value of the distance information between the tweets of the user who continuously sends the tweet is smaller than the threshold value of the tween distance information (S23).

If the entropy value for the distance information is smaller than the threshold value for the distance information, the terminal determines whether the device of the user who sent the tweet continues to belong to the device set DV of the selected user (S24).

If the device of the user who has continuously sent the tweet does not belong to the device set DV of the selected user in advance, the terminal determines the tweet bot as a tweet bot (S25).

However, if the entropy value of the tween distance information is not less than the threshold value for the tween distance information in step S24, or if the device of the user who continuously sends the tweet in step S25 belongs to the device set DV of the user (S26). &Lt; / RTI >

When the terminal is determined to be a tweet bot, the terminal increases the twot bot count (b_count)

(S27). When the SNS user is mistaken, the user count (h_count) is increased to obtain a false alarm probability

(S28).

That is, when the entropy value of the tweet distance information is smaller than the threshold value of the tween distance information and the device of the user who sent the tweet continuously does not belong to the selected device set DV , As a tweet bot.

FIG. 3 is a graph illustrating the detection probability of a tweet robot according to reliability when a twot bot detection method using spatial information according to an embodiment of the present invention is used.

Referring to FIG. 3, the higher the reliability, the higher the probability that a person is recognized as a person, and the twin bot can be detected stably. On the other hand, as reliability decreases, unstable twin bots are detected, but the probability of detecting tweet bots increases accordingly.

In addition, since the present invention shows a twin bot detection probability improved by about 10 to 15% in all the reliability intervals compared to the existing technology, the probability of detecting the twin bot in the same reliability is increased.

FIG. 4 is a graph showing a correlation between a bot detection probability (Bot DP) and a false alarm probability (FAP) when the twot bot detection method using spatial information according to an embodiment of the present invention is used. FIG.

Referring to FIG. 4, it can be seen that, in the present invention, in which a user having a smaller entropy value than a specific threshold value is identified as a tweet bot, the probability of detection of the tweet bot increases as the false alarm probability increases. Compared to the conventional method, the present invention shows a higher detection probability of the tweetbot at the same false alarm probability.

FIG. 5 is a graph illustrating a correlation between a tweet bot detection probability and a false alarm probability when the twot bot detection method using spatial information according to another embodiment of the present invention is used.

Referring to FIG. 5, it can be seen that the present invention shows a higher detection probability of tweet bots in the same false alarm probability as in the conventional technology, as in the twot bot detection method using spatial information according to an embodiment of the present invention.

6 is a block diagram illustrating one embodiment of a smart device that performs methods in accordance with the present invention.

Referring to FIG. 6, the smart device 100 of the present invention may include at least one processor 110, a memory 120, and a network interface device 130 for communicating with a network. The smart device 100 may further include an input interface device 140, an output interface device 150, a storage device 160, and the like. Each component included in the smart device 100 may be connected by a bus 170 and communicate with each other.

The processor 110 may execute a program command stored in the memory 120 and / or the storage device 160. The processor 110 may refer to a central processing unit (CPU), a graphics processing unit (GPU), or a dedicated processor on which the methods of the present invention are performed. The memory 120 and the storage device 160 may be composed of a volatile storage medium and / or a non-volatile storage medium. For example, memory 120 may be comprised of read only memory (ROM) and / or random access memory (RAM).

The smart device 100 according to the present invention having the above-described configuration can perform geo-tagged tweet data from a Twitter server (not shown) by performing the methods described in FIGS. 1 and 2, Detect tweets bot.

The methods according to the present invention can be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer readable medium. The computer readable medium may include program instructions, data files, data structures, and the like, alone or in combination. The program instructions recorded on the computer readable medium may be those specially designed and constructed for the present invention or may be available to those skilled in the computer software.

Examples of computer readable media include hardware devices that are specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include machine language code such as those generated by a compiler, as well as high-level language code that can be executed by a computer using an interpreter or the like. The hardware devices described above may be configured to operate with at least one software module to perform the operations of the present invention, and vice versa.

It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined in the appended claims. It will be possible.

100: Smart Device
110: Processor
120): memory
130: Network interface device
140: input interface device
150: Output interface device
160: Storage device
170: bus

Claims

A method for detecting an SNS bot performed by a terminal,
Constructing a data set comprising a large geo-tagged message;
Setting a threshold for time information that enables specified reliability using time information between consecutively published messages by the same user in the data set;
Setting a threshold value for distance information that enables specified reliability using distance information between transmission positions of consecutively posted messages by the same user in the data set;
Determining whether an entropy value for the message-to-message time information is less than a threshold value for the time information;
Determining whether an entropy value of the distance information between transmission positions is less than a threshold value of the distance information when the entropy value for the message-to-message time information is smaller than the threshold value for the time information; And
And if the entropy value of the distance information between the transmission positions is smaller than the threshold value for the distance information, discriminating the user who has posted the message continuously as the SNS bot.

The method according to claim 1,
In constructing the data set,
Wherein the data set is collected and configured through a streaming API.

The method according to claim 1,
In constructing the data set,
A SNS bot detection method in which each field for a device generating a message is adopted from the metadata of each message, a user ID, a latitude of a device location where the message is transmitted, a longitude of a device location where the message is transmitted,

The method according to claim 1,
In setting the threshold value for the time information,
Wherein the threshold value for the time information means an entropy value for time information enabling the specified reliability.

The method according to claim 1,
In the step of setting the threshold value for the distance information,
Wherein the threshold value for the distance information means an entropy value for distance information that enables the specified reliability.

The method according to claim 1,
In the step of determining the SNS bot,
Further comprising increasing SNS bot count (b_count) to obtain SNS bot detection probability (Bot DP) when the SNS bot is determined to be SNS bot.

The method according to claim 1,
In the step of determining the SNS bot,
Further comprising increasing the SNS user counter (h_count) to obtain a false alarm probability (FAP) when the user is mistaken for the SNS bot.

The method according to claim 1,
Determining whether an entropy value for the message-to-message time information is less than a threshold value for the time information,
If the entropy value of the message-to-message time information is not smaller than the threshold value for the time information, determining that the user who has posted the message continuously is an SNS user.

The method according to claim 1,
Determining whether an entropy value of the distance information between the transmission positions is smaller than a threshold value for the distance information,
If the entropy value of the distance information between the transmission positions is not smaller than the threshold value for the distance information, determining that the user who sent the message continuously is the SNS user.

A method for detecting an SNS bot performed by a terminal,
Constructing a data set comprising a large geo-tagged message;
Setting a threshold value for distance information that enables specified reliability using distance information between transmission positions of consecutively posted messages by the same user in the data set;
Selecting the devices for sending the message by the SNS user and setting them as a set of SNS user devices;
Determining whether an entropy value of the distance information between the transmission positions is smaller than a threshold value for the distance information;
Determining whether the device used for message posting belongs to the set of SNS user devices when the entropy value for the distance information between the transmission positions is smaller than the threshold value for the distance information; And
And if the device used for message posting does not belong to the set of SNS user devices, determining a user who has consecutively posted a message as an SNS bot.

The method of claim 10,
In constructing the data set,
Wherein the data set is collected and configured through a streaming API.

The method of claim 10,
In constructing the data set,
A SNS bot detection method in which each field for a device generating a message is adopted from the metadata of each message, a user ID, a latitude of a device location where the message is transmitted, a longitude of a device location where the message is transmitted,

The method of claim 10,
In the step of setting the threshold value for the distance information,
Wherein the threshold value for the distance information means an entropy value for distance information that enables the specified reliability.

The method of claim 10,
In the step of determining the SNS bot,
Further comprising increasing SNS bot count (b_count) to obtain SNS bot detection probability (Bot DP) when the SNS bot is determined to be SNS bot.

The method of claim 10,
In the step of determining the SNS bot,
Further comprising increasing the SNS user counter h_count to obtain a false alarm probability (FAP) when the SNS user is misidentified.

The method of claim 10,
Determining whether an entropy value of the distance information between the transmission positions is smaller than a threshold value for the distance information,
If the entropy value of the distance information between the transmission positions is not smaller than the threshold value for the distance information, determining that the user who has posted the message continuously is the SNS user.

The method of claim 10,
In determining whether the device used for message posting belongs to the SNS user device set,
If the device used for message posting belongs to the set of SNS user devices, determining that the user who has posted the message continuously is the SNS user.