CN103853841A

CN103853841A - Method for analyzing abnormal behavior of user in social networking site

Info

Publication number: CN103853841A
Application number: CN201410101728.3A
Authority: CN
Inventors: 闫丹凤; 吴海莉; 徐佳
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2014-03-19
Filing date: 2014-03-19
Publication date: 2014-06-11

Abstract

The invention discloses a method for analyzing an abnormal behavior of a user in a social networking site. The method can be used for analyzing abnormal events such as advertising by account stealing, link spamming, network recreation and defrauding of social networking friends in the social networking site. The method comprises the following steps: acquiring user behavior data based on a network crawler technology; analyzing and detecting the data by using a user behavior analysis technology; when abnormality is detected, giving an alarm. Each of three functional units namely a data acquisition unit, an analysis and detection unit and an abnormality alarm unit is adopted to complete a function of the method. The data acquisition unit is used for acquiring the user behavior data by using the network crawler technology; the analysis and detection unit is used for analyzing and detecting the acquired user behavior data by using the user behavior analysis technology; the abnormality alarm unit is used for sending an alarm message when the abnormality is detected. According to the method, the abnormal events widely existing in the social networking site can be conveniently, flexibly and intelligently detected; a social networking site provider can timely find malicious users by using the method, so that the losses of net citizens are reduced.

Description

A kind of analytical approach of social network user's abnormal behaviour

Technical field

The present invention relates to a kind of analytical approach of social network user's abnormal behaviour, for detection of user's abnormal behaviour of issuing malice link, waste advertisements, swindle message etc. in social network sites, belong to network security detection technique field.

Background technology

The demonstration of CNNIC statistics, within 2013, China's microblog users quantity reaches 5.36 hundred million, in addition, uses the number of users of Renren Network also to reach 2.8 hundred million more than.Due to the existence of important entity indispensable in social networks (being mass users), impelling the social development of commercial class and a mankind, and be accompanied by the flourish of network social intercourse, various information resources also constantly exchange and propagate in social process, and because these information not only may comprise user's privacy information, and may be the trade secret of some company, thereby its information value be more and more approved.Be accompanied by microblogging, everybody etc. being surging forward of social activity application, the safety problem based on social networks is also more and more outstanding, for example, the fishing fraud quantity of utilizing in recent years social networks to implement just sharply increases.

Trust between social network good friend's relation and approval, be the starting point that lawless person implements rogue activity, and this is also the root that social networks produces safety problem.Lawless person implements to steal user profile by stealing user account number, inveigles ad click, the unlawful activities such as swindle of borrowing money.In recent years, in the report that many security firms provide, all show, have the rogue activity of the borrow money phishings such as swindle, virtual prize drawing of 1/4 left and right to propagate by social networks, and the analyses and prediction of these security firms also claim, the social safety of comprehensive improvement will become the new problem of network security.

Summary of the invention

Given this, target of the present invention is for the stolen rear issue swindle of the normal account number of social networks, fishing, this class anomalous event of the malicious messages such as junk information, a kind of accident detection method is proposed, the method crawler technology Network Based crawls user behavior data, carry out behavior modeling and analyzing and testing based on user behavior analysis technology and Mathematical Modeling Thought, in the time abnormal account being detected, send short message alarm, can be social network supplier abnormal user list is provided, thereby greatly reduce network defraud, fishing and the harm of junk information to netizen, the method is as a part for Web safety detection simultaneously, safety problem under research Web environment is also had to certain reference value and directive significance.

Social network accident detection method crawler technology Network Based and Web analytic technique that the present invention proposes are obtained the message data that user issues in social network, then these data are carried out to user behavior analysis, thereby detect abnormal user, and carry out alarm.Use this method can detect the anomalous event that target social network sites (Renren Network, microblogging etc.) exists, comprise steal account number sending advertisement, issue malice link, social good friend's wealth etc. " is poured water ", defrauded of to network.The present invention is mainly made up of three major function unit, i.e. data capture unit, analyzing and testing unit and abnormal alarm unit.

The functional characteristics of described data capture unit is as follows:

Obtain the operating right of target detection social network, complete the crawl to user message data (issued state, daily record, photo, share, the information such as comment) by web crawlers technology, to classifying by user after the Data Analysis capturing and depositing file in, these files are exactly the input of analyzing and testing unit.

This unit mainly comprises that user logins, data crawl, Data Analysis and four subelements of data output.

The functional characteristics that described user logins subelement is as follows:

Create a Singleton Connector class, use DefaultHttpClient, HttpGet and HttpPost.HttpGet is used for obtaining Renren Network entrance URL, sets Renren Network login URL in HttpPost, sets the essential information (comprise user name, password, Renren Network domain name etc., these parameter informations can be got from dispensing unit) of login user simultaneously.Then carry out login () method, if entered into the page after login, just show successfully to login, then user rs credentials information is preserved as Cookie, while crawl so that next, use.

The functional characteristics that described data capture subelement is as follows:

Realize ICrawler interface and IParser interface, wherein IParser interface inheritance HtmlParser.This unit mainly comprises CrawlFeeds class, CrawlTimelineFeed class, FilterOpenUser class and FeedController class.Wherein on FeedController class stricti jurise, do not belong to data placement unit, because it is used for controlling, data capture and data output storage.After user's login, first FilterOpenUser starts to obtain all relevant URL of each user to be grabbed from the user node of login.If this user to be grabbed is the good friend of login user, can directly crawl; If not good friend, some informational needs just can be checked after having added good friend, obtain all userId lists of checking by such mode.Then the userId list that FeedController obtains take FilterOpenUser is input, calls CrawlFeeds or CrawlTimelineFeed crawls.In capturing, adopt the increment type grasping means of timer.The method of timer captures by setting the concrete time interval.The concrete time interval is set by dispensing unit.While crawling, crawl respectively according to userId exactly.

The functional characteristics of described Data Analysis subelement is as follows:

Resolve crawling the page, then classify according to state, daily record, the link of sharing etc. again crawling all data that subelement crawls by userId, and extract the information such as issuing time, particular content of these information, to be also that html text is resolved to the particular content of message.This subelement is mainly FeedFilter class and HtmlParser class.Wherein HtmlParser is a ripe routine library, and it is that HTML based on Java code resolves class libraries, and it does not rely on other Java storehouse, is mainly used in transformation and extracts HTML, and can resolve at a high speed, exactly HTML.This unit by using HtmlParser extracts the content of text of message.HtmlParser redefines the information of HTML by Node, AbstractNode and Tag.In program, by definition NodeFilter object, the label that text input is provided in html is filtered, can find easily the content of Message-text.

The functional characteristics of described data output subelement is as follows:

The data result obtaining by reptile is with the file output with userId name, and storage data content form is hereof data ID, data type, content, content language, issuing time.

The functional characteristics of described analyzing and testing unit is as follows:

The result obtaining take data capture unit is input, it is carried out to pre-service, and in analyzing detecting method, proposed 7 user behavior features, these 7 features are carried out respectively to modeling, the all historical data of user, according to these 7 characteristic model modelings, is obtained to user's behavior profile.To the data after last time point of historical data, first classify according to 7 behavioural characteristics, then each behavioural characteristic is obtained to an abnormal score, finally 7 abnormal scores are calculated to total abnormal score, thereby judge that whether this user is abnormal.

The analyzing detecting method that this unit adopts comprises user behavior modeling, and how the similarity analysis of user message, calculate the abnormal score of message, and how finally to detect four aspects of anomalous event.

The functional characteristics of described user behavior modeling is as follows:

User behavior profile is that the historical behavior on social networks obtains by user, and it can be used for expecting that this user is at normal behaviour in the future.In order to set up user's behavior profile, i.e. user behavior modeling, just need to this user be distributed on the message flow on social network sites, and these message flows result that data capture unit obtains just.So the result can usage data acquiring unit obtaining is carried out the foundation of behavior profile.

For the feature of social networks and the needs of detection, for every message, 7 features have been set in this unit, for statistical model of each features training.Each model has wherein reacted the characteristic of this message aspect, complete to all message analysis of certain user after, just can obtain the eigenwert of this user aspect these 7, just can expect what kind of the message that this user sends should be.Below 7 of every message characteristic models are described in detail.

1, the time (hour/day) that message sends.This characteristic model is used for catching which time of an account number in one day and enlivens.Many users are sluggish the determining time in one day, for example dinner hour or the length of one's sleep.By the time that in user's message flow, user gives out information, which can determine is non-enlivening the time, is distributed on so the non-message of enlivening the time and is just considered to abnormal.

2, message source.The application program giving out information.Most of social network sites provide legacy network and mobile network to access the user to them, and for for example iOS of application program and the Android of mobile platform.Many social networks provide the multiple application program independently being created by third party developer.Certainly,, under default situations, third party application can not be sent out the account of message to user.But if a user selects this mode to send, he can authorize this privilege and apply to this, this just make this third party be applied in the situation that there is no user rs credentials can calling party personal information.In fact, show according to dependent evaluation, third party application is often used to send malicious messages.

Whether in the past this model is used for determining user's normal use application-specific, or whether this is to send message by certain application program for the first time conversely speaking.Whenever user uses a new application issued message, this variation may show, an assailant successfully lures victim to authorize malicious application to access his account.

3, Message-text (language).User can freely use any language to give out information.But in fact each user is only with category of language few in number give out information (conventionally, one or two).Therefore, particularly, when this model feature (message language) is metastable, unexpected language change shows that user behavior is suspicious.

Determine the language that a message is used, utilize libtextcat storehouse.This storehouse is the increase income storehouse of an execution take n-gram as basic Algorithm of documents categorization.

4, message topic.The message back that user issues is toward comprising many chattering or secular information.But a lot of users have one group of their topic of often talking about, such as favorite sports team, band, or TV programme.The message of issuing as user concentrates in several topics conventionally, then issues suddenly some different and irrelevant topics, and this new message should be cited as extremely.

Generally, never contextual short text fragments, infers that the topic of message is difficult.But social network-i i-platform allows user's labeled message, which topic is the message of clearly specifying them are.When in the situation that having label, they provide valuable information source.The message marking mechanism of a well-known example is the topic label of Renren Network, microblogging, conventionally use " ## ", two " # " centres be topic.

5, the link in message.Under normal circumstances, the message that is distributed on social network sites comprises the link of pointing to other resources, as blog, and picture, video or news article.Occur till now from social network, the link in message all extensively exists, thereby more all concentrates on the analysis to URL about the security study work of social network in the past, and using it as determining whether message is unique factor of malice.Paper is the part using the URL in message as user behavior profile also, but just as a single characteristic model.In addition, the establishment behavior aspect of model is mainly the normal activity for catching user.That is to say, whether this detection method does not attempt to detect a URL itself is maliciously, under normal circumstances can the such URL of no transmission but go to detect this user.

In order to determine the link occurring in message, this method is only utilized the domain name of URL in link.Its reason is that user may often quote the content in same domain name.For example, many users often see specific news website and blog, and are often linked to the interesting article there.Malice link, on the other hand, sensing be illegal website.Therefore when, link information comprises the domain name that the past do not occur, represented a kind of variation.The behavior, model also considered to comprise in message the frequency of link, and user is linked to the consistance of specific website.

6, mutual between user.Social networks provides and between unique user, directly carries out mutual mechanism.Modal mode is by sending messages directly to recipient.Different social networks has different mechanism.As time goes on, user social networks just set up one with the historical record of other user interactions.Just can catch a user's historical intersection record by this characteristic of social network.In fact, it follows the tracks of all cross mutual of user account.The object that sends message is the attention in order to obtain recipient, and therefore the direct interaction mode between this user is often used to send rubbish message.

7, contiguous geographic position.In many cases, the friend of user in social networks is exactly other users that get close to them in reality.For example, the user of a Renren Network will have and much stay in same city, school of upper same institute, or be operated in the friend of identical company.If this user starts suddenly and people's contacts of living in another continent, this may be suspicious.This feature is locality or non-indigenous for catching message.

Every message for user is carried out modeling by above-mentioned 7 characteristic models, then it is carried out to model training and assessment.

The functional characteristics of described model training is as follows:

The input of model training is a series of message (message flow) that data capture unit crawls.For each message, extract above-mentioned 7 features, for example send the link comprising in the source program of message and message.

Each characteristic model represents with set M.Each element of M is a key-value pair tuple <fv, c>.Fv is eigenwert (for example, the language model of English, or link model example.com).C represents the message number that fv value occurs.In addition, each model is stored the total N of training message.

Training pattern is divided into two classes:

(1) necessary model is to have an eigenwert for each message, and the model that always occurs of this eigenwert.Default models comprises the time that message sends, message source, contiguous geographic position and message language.

(2) optional model refers to for a message, and this model not necessarily always need to have value.Meanwhile, be different from necessary model, for a message, this model can corresponding multiple values.Optional model, comprises link, mutual and theme between user.For example, a message may have 0, one or more link.For each optional model, we retain a fv=null, and " c " of this eigenwert value is proposed to (for example, there is no the message count of link).

Training for this characteristic model of message transmitting time is slightly different.Based on description above, first system is extracted message and is sent in some.Then, it is by the storage fv of each hour, and the message count being published in this hour.So just having a problem, is exactly may be discontinuous the time period, is discrete.Therefore the message that, near time point user's normal time sends just may be thought mistakenly extremely.

For fear of this problem, set-up procedure after time model is trained.Be exactly, for each hour i, to consider two hours adjacent with it specifically., for each key-value pair <i of M, C _i>, a new calculating variable C ' _ibe used for calculating i hour C _ithe average giving out information, variable C _i-1be used for storing the message count of transmission in that hour before, C _i+ ₁user stores the message count that hour after i hour sends.When calculating C ' _i, just replace key-value pair <i, C with it _ic in > _i.

The functional characteristics of described model evaluation is as follows:

The assessment of model, calculates the abnormal score of 7 behavioural characteristic models, and the most at last these 7 values to adopt certain Algorithms Integrations be a value, i.e. the abnormal score of this message.

the calculating of 7 abnormal scores of characteristic model:

In the ordinary course of things, when the eigenwert in the necessary model of a message does not appear in user's information flow, or eigenwert occur number of times do not mate with the key-value pair in M, this message is exactly abnormal so.

For the characteristic model of necessary model, the abnormal score of message is calculated in the following way:

1, first to from message, extract the fv value of characteristic model to be analyzed.If comprise the key-value pair using fv as first element in M, so just can from M, extract whole key-value pair.If there is not the key-value pair take fv as first value in M, this message is exactly abnormal so, and program will be returned to abnormal score 1 here so.

2, second step, analyzes according to user's behavior profile whether fv is abnormal.C and M compare, based on formula:

\overset{&OverBar;}{M} = \frac{Σ_{i = 1}^{| | M | |} c_{i}}{N}

Wherein Ci is for the each element <fv in M, second value c of c>.If c is greater than or equal to

this message is just considered to meet user's behavior profile, and returns to abnormal score 0.Reason is that this behavioural characteristic all appears in many message in user's past, meets user's behavioural habits, is normal behavior.

If c is less than

it is abnormal that this message is just considered to.Our system-computed goes out the relative frequency of f and fv, according to formula

f = \frac{c_{fv}}{N}

System is returned to abnormal score (1-f).

For the characteristic model of optional model, the abnormal score of message is calculated in the following way:

First extract the fv value of the characteristic model that will analyze in message.If comprise the key-value pair take fv as first value in M, so just judge that this message meets user's behavior profile, and system is returned to exceptional value 0.

If do not comprise the key-value pair take fv as first value in M, this message is just judged as abnormal.Exceptional value is in this case defined as being characterized as for this of this this user of model Probability p of null.Intuitively, if a user uses hardly a kind of feature in social networks, but in a piece of news, but comprised the fv value of this characteristic model, this message is exactly Height Anomalies so.Probability p is calculated by formula below:

p = \frac{c_{null}}{N}

If do not comprise the key-value pair take null as first element, so C in M _nullbe exactly 0.P is exactly exceptional value.

Give one example, regard to the detection of language model under consideration: a specific user has issued 21 message, wherein 12 is English issues, and 9 is that Chinese is issued.The set M of this user language model is exactly: (<English, 12>, <Chinese, 9>).

The lower a piece of news that this user issues will have following three kinds of situations:

New information is issued by English.First from M, extract key-value pair <English, 12>, calculates by formula (4-1)

then use c=12 and

compare, because c is greater than

so this message is normal, returns to exceptional value 0.

New information is issued with French.Because never issued message with French before user, this message is exactly suspicious, returns to exceptional value 1.

New information is issued by Chinese.First from M, extract key-value pair <Chinese, 9>, calculates by formula (4-1)

then use c=9 and compare, because

so this message is abnormal.This user's Chinese relative frequency is:

therefore, return to exceptional value 1-f=0.58, this means that this information is abnormal.But this value is not enough to illustrate that this message is malice, because may be only that this piece of news is abnormal, be not that this type of a large amount of message exists.

the calculating of final abnormal score

By the exceptional value for each model that calculated above, then need them to be integrated into the exceptional value of a result as this message.This exceptional value obtains by the method for weighted sum.This method is by based on being indicated that normal and abnormal training set (user and user message) uses SMO(Sequential Minimal Optimization algorithm, being proposed in 1998 by the John C.Platt of Microsoft Research, and become the fastest quadratic programming optimized algorithm) algorithm obtains the optimal weights of each characteristic model.Certainly, different social networks needs different weights for these 7 characteristic models.If the abnormal score of a message has exceeded threshold values (elaboration sees below), this message has just been violated this user's behavior profile so, is abnormal.

The functional characteristics of the similarity analysis of described user message is as follows:

User's behavioral data carried out to 7 feature modelings and calculating after the abnormal score of every message, also needing all message datas of user to classify, and carry out similarity analysis, because the message such as fishing, swindle are to need a large amount of propagation.So when only having a piece of news to be judged as when abnormal, do not think that its corresponding account number occurs abnormal, need further to observe more other similar message, while only having similar message to reach some, just assert that the account number of these message of transmission is abnormal account number.

The calculating of content similarity has two kinds of methods: 1. content of text similarity; 2. the URL similarity comprising.

Content of text similarity classification, comprises similar message and is considered to have correlativity and is divided into a class in message.Use n-gram algorithm to realize, in this method, n gets 4.Because if two message are shared at least word of 4-gram, these two message are exactly similar so.N value is larger, and it is more accurate that similarity is calculated.But generally all get n=3, or store when n=4, otherwise n is large again, calculate to wait and expend very large, substantially can not practical application.

The classification of URL similarity, refers to that these two message are similar when comprising more than at least one identical URL link in two message.In social network sites, a lot of rubbish message general values comprises an inquiry string in URL.Therefore the calculating of this classification is in employing URL, to remove argument section to mate.The URL that many social network sites comprise during user is given out information shortens by this social network sites definition.So there will be different short URL to point to the situation of same page.So in the time that URL is short network address, need to obtains final page URL after short network address and mate by expanding.

The functional characteristics of described abnormality detection is as follows:

This function need be in conjunction with two aspects: the one, and the abnormal score of every message, the 2nd, the grouping of user message while carrying out similarity analysis, guarantees a large amount of of message.The rule of abnormality detection is: as long as there is the abnormal scoring of the personal behavior model of message to exceed certain threshold values in each grouping, just judge that this is grouped into unexpected message group, wherein account number corresponding to all message is abnormal account number.The account form of threshold values is:

th(n)=max(0.1,kn+d)

Wherein n is number of packet, obtains by experiment working as k=-0.005, and when d=0.82, result is the most accurate.

The functional characteristics of described alarm unit is as follows:

This unit mainly provides alarm and the two kinds of services of inquiry of reporting to the police, and comprises that note sends subelement and reports to the police inquiry subelement.Warning query function need to be used database, adopts MySQL database.

Note transmission subelement provides with Curl, Thrift and tri-kinds of modes of Json and sends, and utilizes java multithreading to solve concurrent problem.Safety problem when avoiding sending note, needs safety certification before sending note.

Authentication adopts the mode of HTTP Header to carry out.

Provide parameter as follows:

1, username: user's name;

2, Timestamp: user's current time stamp, form yyyy-MM-dd hh:ss:mm;

3, Nonce: prevent from retransmitting, the random string that character string and numeral form all must be different in calling at every turn;

4, password: with the unique corresponding key code of user's name.

5, Signature: signature (signature equals username+Timstamp+Nonce+password and is spliced into character string, then processes by digest algorithm, for example MD5).

User is by HTTP head " Authentication ", by Username+ t+Timestamp+ t+Nonce+ t+md5{Signature}, import into as character string.

The inquiry subelement of reporting to the police need to design database table, this unit provides by transmitting time, sends the function of the inquiry such as object warning messages for the user of social network accident detection instrument simultaneously, can be by inputting unexpected message type in plain frame, receive mobile phone and note transmitting time is according to condition inquired about searching, also can entirely inquire about, page listings part is shown the relevant abnormal user Id of All Alerts message, unexpected message type, unexpected message issuing time, short message state, note transmitting time, receives mobile phone etc.

Accompanying drawing explanation

Fig. 1 is the illustraton of model of the analytical approach of social network user's abnormal behaviour of the present invention.

Fig. 2 is the process flow diagram for the concrete operations of data capture unit.

Fig. 3 is the process flow diagram for the concrete operations of analyzing and testing unit.

Fig. 4 is the process flow diagram for the concrete operations of abnormal alarm unit.

Embodiment

For making the object, technical solutions and advantages of the present invention clearer, below in conjunction with accompanying drawing, the present invention is described in further detail.

Referring to Fig. 1, introduce the general function composition structure of the method: data capture unit, analyzing and testing unit and three of abnormal alarm unit component units, wherein:

Data capture unit, be intended to extract by user this user's behavioural information from the mass data of social network numerous and complicated, first need to obtain the identification authorization of target detection social network sites, then Adoption Network crawler technology obtains from the initial all user's subset with the authority of checking of login node, can adopt time shaft data to carry out the crawl of all customer data in subset for this subset, from the result set crawling, analyze again according to userId, obtain all strange things of the user that this userId is corresponding, can extract this userId issues and all states of sharing, daily record, , the data messages such as link, then these data messages are carried out to html text parsing and language parsing, after parsing, export in order to the document form of userId name, file content comprises data Id, issuing time, data type, content, language form, whether comprise link, chained address etc.

Analyzing and testing unit, the user data that data capture unit is obtained is set up user behavior model, and it is trained and is assessed, and then each user's behavioral data is carried out to content-based similarity classification, finally carries out abnormality detection according to special algorithm.

Alarm unit is reported to the police in the time abnormal user being detected, provides note to send and warning query function.

Referring to Fig. 2, concrete operation step is as follows:

1. use the username and password login in the registration of target detection social network sites;

2. start to obtain all relevant URL of each user to be grabbed from the user node of login.If this user to be grabbed is the good friend of login user, can directly crawl; If not good friend, some informational needs just can be checked after having added good friend, obtain all user's subset to be captured by such mode;

3. according to certain timer frequency, user and corresponding URL are carried out to increment type crawl, return to step 2;

4. open multithreading the time shaft data of all users in user's subset are captured, first judge whether this user is had to access rights: no, this user data is captured to thread and finish; To proceed subsequent step;

5. couple URL that comprises user data information carries out page parsing, thereby obtains the information such as this type of message, message content, issuing time, issue source;

6. continuous repeating step 3～5, until all time shaft message datas of all users have all crawled;

7. the formal output by each user's message data with file according to unified data layout (data ID, data type, content, content language, issuing time etc.).

Referring to Fig. 3, concrete operation step is as follows:

1. the input data using the output of data capture unit as this unit;

2. all message datas of couple each user were divided into groups according to the time period of issuing;

3. every message in pair each grouping is carried out modeling according to 7 characteristic models, and calculates the corresponding abnormal score of each characteristic model, and integrates, and obtains every abnormal score that message is final;

4. the message in pair each grouping is carried out content-based similarity analysis;

5. message similarity in step 4 that the abnormal score obtaining by step 3 is greater than threshold values is also large, and this user is exactly abnormal user.

Referring to Fig. 4, concrete operation step is as follows:

1. when analyzing and testing unit inspection is after abnormal user, just trigger and send short message event;

2. because sending, note needs to pay, so first carry out safety verification by the mode of Http Header before transmission note;

3. the username in checking Http head and whether password mates and be validated user verifies whether there is note sending permission by the mode of signature verification simultaneously;

4. when after safety verification success, send note and inform transmit leg by asynchronous system to receiving user; Do not there is authority and inform that the incorrect or user of user cipher does not exist;

5. send short messages and adopt multithreading, asynchronous system.

Claims

1. an analytical approach for social network user abnormal behaviour, can detect the anomalous event that target social network sites (Renren Network, microblogging etc.) exists, comprise steal account number sending advertisement, issue malice link, social good friend's wealth etc. " is poured water ", defrauded of to network.It is characterized in that, crawler technology Network Based obtains user behavior data, basis using these data as user behavior analysis, the message that user is issued is carried out modeling and training, extract user's behavior profile, whether abnormal according to user's behavior profile assessment new information, in the time anomalous event being detected, send alarm.

The method is mainly made up of three functional units, i.e. data acquisition, analyzing and testing and abnormal alarm, wherein:

Data acquisition, be intended to get the Deep Web data of user in social network, be that user issues and the state of sharing, daily record, the data such as link, these data need Adoption Network reptile method to carry out Deep Web Crawler to social networks, i.e. effective login user account number based in the registration of target detection social network sites, authorizes thereby adopt account login target detection website to obtain website, crawls out user's Deep Web data.

Analyzing and testing, the user data obtaining according to data capture unit is set up user behavior model, and it is trained and is assessed, and then each user's behavioral data is carried out to content-based similarity classification, finally carries out abnormality detection according to special algorithm.

Abnormal alarm is reported to the police in the time abnormal user being detected, provides note to send and warning query function.

2. data acquisition functions according to claim 1 unit, it is characterized in that: the analysis foundation that obtains the method---social network user data, first need to obtain the identification authorization of target detection social network sites, then Adoption Network crawler technology obtains from the initial all user's subset with the authority of checking of login node, can adopt time shaft data to carry out the crawl of all customer data in subset for this subset, from the result set crawling according to userId, userId is user's unique ID number, analyze again, obtain all strange things of the user that this userId is corresponding, can extract this userId issues and all states of sharing, daily record, the data messages such as link, then these data messages are carried out to html text parsing and language parsing, after parsing, export in order to the document form of userId name, file content comprises data Id, issuing time, data type, content, language form, whether comprise link, chained address etc.

3. user behavior modeling method in analyzing and testing according to claim 1 unit, is characterized in that: the message flow being distributed on social network sites by user is set up user's behavior profile, and the output that data capture unit obtains just of these message flows.

For the feature of social networks and the needs of detection, for every message, 7 features are set in this unit, for statistical model of each features training.Each model reacts the characteristic of this message aspect, after complete to all message analysis of certain user, can obtain the eigenwert of this user aspect these 7, thereby can expect the message content that this user sends.

4. 7 kinds of features according to claim 3, it is characterized in that: 7 characteristic models of 7 kinds of corresponding every message of feature, be respectively mutual between the time (hour/day) of message transmission, the application program giving out information, language form, topic, link, user and geographic position, and these 7 kinds of features be divided into two classes:

(1) necessary model is to have an eigenwert for each message, and this eigenwert always occurs.Acquiescence feature comprises the time that message sends, message source, contiguous geographic position and message language.

(2) optional model refers to for a message, and this feature not necessarily always need to have value.Meanwhile, be different from necessary model, for a message, this feature can corresponding multiple values.Optional model, comprises link, mutual and theme between user.For example, a message may have 0, one or more link.For each optional model, we retain a fv=null, and " c " of this eigenwert value is proposed to (for example, there is no the message count of link).Fv refers to certain eigenwert, and c represents the message number that fv occurs.

5. training and the assessment of user behavior model in analyzing and testing unit according to claim 1, is characterized in that:

Training for model:

Input is a series of message (message flow) that data capture unit crawls.For each message, extract above-mentioned 7 features, for example send the link comprising in the source program of message and message.Each characteristic model represents with set M.Each element of M is a key-value pair tuple <fv, c>.Fv is eigenwert (for example, the language model of English, or link model example.com).C represents the message number that fv value occurs.In addition, each model is stored the total N of training message.

Training for this characteristic model of message transmitting time is slightly different.Be exactly, for each hour i, to consider two hours adjacent with it specifically., for each key-value pair <i of M, C _i>, a new calculating variable C ' _ibe used for calculating i hour C _ithe average giving out information, variable C _i-1be used for storing the message count of transmission in that hour before, C _i+ ₁user stores the message count that hour after i hour sends.When calculating C ' _i, just replace key-value pair <i, C with it _ic in > _i.

Assessment for model:

Calculate the abnormal score of a piece of news, see whether this message does not meet user's behavior profile.

For characteristic model, the abnormal score of message is calculated in the following way:

(1) first necessary model will extract the fv value of characteristic model to be analyzed from message.If comprise the key-value pair using fv as first element in M, so just can from M, extract whole key-value pair.If there is not the key-value pair take fv as first value in M, this message is exactly abnormal so, and program will be returned to abnormal score 1 here so.

(2) analyze according to user's behavior profile whether fv is abnormal.C and

compare, based on formula:

\overset{&OverBar;}{M} = \frac{Σ_{i = 1}^{| | M | |} c_{i}}{N}

If c is less than

it is abnormal that this message is just considered to.Calculate the relative frequency of f and fv, according to formula

f = \frac{c_{fv}}{N}

System is returned to abnormal score (1-f).

For optional characteristic model, the abnormal score of message is calculated in the following way:

p = \frac{c_{null}}{N}

6. content-based similarity classification in analyzing and testing unit according to claim 1, it is characterized in that: content-based similarity classification in described analyzing and testing unit, the reason that account number abnormality detection need to be carried out content-based similarity analysis is based on such fact: the message such as fishing, swindle are to need a large amount of propagation.So when only having a piece of news to be judged as when abnormal, do not think that its corresponding account number occurs abnormal, need further to observe more other similar message, while only having similar message to reach some, just assert that the account number of these message of transmission is abnormal account number.

The calculating of content similarity has two kinds of methods: the one, and content of text similarity; The 2nd, the URL similarity comprising.

7. abnormality detection in analyzing and testing unit according to claim 1, is characterized in that: mainly detect two classes abnormal: the one, and the suspicious user group being encroached on; The 2nd, non-suspicious user of being encroached on or application.Their difference is: the former exists normal user behavior profile, has issued afterwards a large amount of similar message; The latter is issuing a large amount of similar message from the beginning to the end.

What data capture unit obtained is the user data at certain hour interval, thereby in analyzing and testing unit, the message of content-based classification is also in a certain time interval.Data in this each time interval are called a grouping.For each grouping, this method checks whether the message of all user accounts has violated its user behavior profile.Whether based on such analysis, just can detect an account is abnormal.

The rule that abnormal account number detects is: as long as there is the abnormal scoring of the personal behavior model of message to exceed certain threshold values in each grouping, just judge that this is grouped into unexpected message group, wherein account number corresponding to all message is abnormal account number.The account form of threshold values is:

th(n)=max(0.1,kn+d)

Wherein n is number of packet, obtains by experiment working as k=-0.005, and when d=0.82, result is the most accurate.From formula, the unexpected message decision threshold of grouping small scale is higher, and the sweeping threshold values that divides into groups is lower.

8. abnormal alarm according to claim 1 unit, is characterized in that: described alarm unit provides alarm and the two kinds of services of inquiry of reporting to the police, and three kinds of method of calling---Curl, Thrift and Json mode are provided.Wherein, alarm provides in the mode that sends note.