CN109558555A - Microblog water army detection method and detection system based on artificial immunity danger theory - Google Patents

Microblog water army detection method and detection system based on artificial immunity danger theory Download PDF

Info

Publication number
CN109558555A
CN109558555A CN201810950560.1A CN201810950560A CN109558555A CN 109558555 A CN109558555 A CN 109558555A CN 201810950560 A CN201810950560 A CN 201810950560A CN 109558555 A CN109558555 A CN 109558555A
Authority
CN
China
Prior art keywords
microblog
user
microblogging
danger
antigen
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810950560.1A
Other languages
Chinese (zh)
Other versions
CN109558555B (en
Inventor
杨超
张*
秦廷栋
项振辉
陈炳秋
何先先
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hubei University
Original Assignee
Hubei University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hubei University filed Critical Hubei University
Priority to CN201810950560.1A priority Critical patent/CN109558555B/en
Publication of CN109558555A publication Critical patent/CN109558555A/en
Application granted granted Critical
Publication of CN109558555B publication Critical patent/CN109558555B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention belongs to micro blog network technical fields, disclose a kind of microblog water army detection method and detection system based on artificial immunity danger theory, the thought of artificial immunity is applied in the detection of microblog users behavioural characteristic, obtains microblog users data using focused web crawler;It is portrayed by the analysis method based on user behavior characteristics and defines network navy behavior, distinguish the characteristic attribute of network novel waterborne troops and normal users;Finally the signal processing mechanism of artificial immunity danger theory is applied in network navy detection, using the waterborne troops user in the Dendritic Cells algorithm DCA detection microblogging of danger theory.The present invention obtains microblog users data using the mode of the focused web crawler based on Python, and with the data of structuring storage to database, which is easier to obtain data set, can reasonably acquire all kinds of behavioral datas of user, has many advantages, such as to crawl that the period is short, the quality of data is high.

Description

Microblog water army detection method and detection system based on artificial immunity danger theory
Technical field
The invention belongs to micro blog network technical field more particularly to a kind of microblog water armies based on artificial immunity danger theory Detection method and detection system.
Background technique
Currently, the prior art commonly used in the trade is such that
Micro blog network waterborne troops refers to some by interests driving, to reach the authenticity such as scramble data, mispriming The purpose of spin, damage citizen's interests, manufactures in microblogging by manipulation software robot or navy account number, propagates void Affectedly see the general name with junk information producers such as junk information.Some data mining technologies are used in microblog water army detection, fixed The high discrimination feature of justice or behavior pattern find hiding network navy.
Current main navy detection method is as follows:
Navy detection method based on content characteristic: including text classification, text emotion analysis and Text Orientation point The methods of analysis, by calculating content of microblog and junk information similarity, or the similarity of comment content and comment spam, to identify Network navy.
Navy detection method based on environmental characteristic: by obtaining TCP footprint information in network environment, IP blacklist is believed Breath, robot website order tracking and routing iinformation etc., which connect, analyzes the network level feature of waterborne troops, realizes water Army's tracking.
Navy detection method based on user characteristics: special by the relationship characteristic and behavior of the network user of analysis variation Sign chooses correlated characteristic attribute training classifier, the detection of micro blog network waterborne troops is then carried out with trained classifier.
In conclusion problem of the existing technology is:
Navy detection method based on content characteristic, due to the complication and disparate networks platform systems of real name of network environment Constraint, waterborne troops are generated by previous system batch operation, are gradually converted into a kind of novel waterborne troops operated by real user, the latter The junk information of manufacture is intended to normal users, no longer has significantly recognizable feature, therefore this method cannot be effective It was found that the novel waterborne troops of network.
Navy detection method based on environmental characteristic, due to nets such as TCP footprint information, IP black list information and routing iinformations Network environmental characteristic information can not be covered up by modification, therefore the detection method recognition accuracy is higher, but network environment class data set More difficult acquisition, therefore program replicability is lower
Navy detection method based on user characteristics, this method can find hiding network navy well, and more Suitable under social network-i i-platform environment waterborne troops detection, but existing characteristics description not comprehensively, to the mass data of multi objective at The problems such as managing lower efficiency and needs a large amount of training datasets.
Solve the difficulty and meaning of above-mentioned technical problem:
(1) due to the enhancing of self hidden consciousness of network navy, simple waterborne troops's detection based on content characteristic can be missed The new network waterborne troops propagated mostly using normal text feature as illusion, practicability are lower.Note of the present invention from microblog users Volume, issuing microblog forwarding, are commented on, are thumbed up etc. in use processes and excavating the specific behavior pattern of microblog water army, to waterborne troops's behavior Feature is analysed in depth, and the important attribute that can distinguish waterborne troops and non-waterborne troops is excavated, these attributes carve microblog water army feature It is decorated with important function.
(2) traditional to there is very big difficulty in terms of data acquisition based on the navy detection method of environmental characteristic, it can push away Wide property is lower.The present invention chooses focused web crawler strategy, logs in the pass landing approach for obtaining Sina weibo by simulating, And URL search strategy is worked out, the Html obtained under specified link is saved, and is finally parsed, is translated into Html Structural data is stored into database.Data acquisition strategy in the present invention crawls high-efficient, and can design and climb according to demand The particular content of specified page is taken, replicability is high, to realize that waterborne troops's detection provides good data supporting.
(3) behavior of waterborne troops gradually complicates at present, and the feature for choosing local, single aspect carries out the detection of waterborne troops, Can existing characteristics description it is not comprehensive, the problems such as being easy to cause identification there are errors.The present invention is with the basic act (note of microblog users Volume information, registion time etc.), user issues behavior (issuing microblog etc.), and user pays close attention to behavior (concern, bean vermicelli etc.), Yong Huzhuan (forwarding, comment, thumb up) is distributed as starting point, more comprehensive, deep grind is carried out to the behavioural characteristic of microblog users Study carefully, and final result of study is applied in microblog water army detection.The present invention describes more comprehensively, to reduce to the feature of microblog water army The identification error of waterborne troops's detection more fully chooses feature in microblog water army detection and plays an important role.
(4) traditional waterborne troops's classification and Detection method based on user characteristics needs a large amount of training datasets, and detection efficiency is low And applicability is not high.The present invention detects the signal processing mechanism of artificial immunity danger theory applied to network navy, using danger The theoretical Dendritic Cells algorithm (DCA) in danger detects the waterborne troops user in microblogging, and DCA algorithm has does not depend on knowledge base, meter It is high-efficient, the features such as rate of false alarm and rate of failing to report can be reduced.The present invention is based on the characteristics of DCA algorithm to realize waterborne troops's detection, tool The advantages that having computational efficiency high, being not necessarily to training dataset and higher Detection accuracy.
Summary of the invention
In view of the problems of the existing technology, the present invention provides a kind of microblog water armies based on artificial immunity danger theory Detection method and detection system.It is an object of the invention to the thought of artificial immunity danger theory is introduced into user behavior characteristics Analysis in, to efficiently identify microblog water army user.By analyzing the behavioural characteristic of Sina weibo waterborne troops, it is total to choose microblogging Whether number microblogging grade, authenticates, the characteristic attributes such as sunlight credit, number of fans, using the analysis result of the above attribute as difference water The characteristic signal of army and normal users, and it is real based on Dendritic Cells algorithm (Dendritic Cells Algorithm, DCA) The identification of existing Sina weibo waterborne troops.
In social network environment, the problems such as user caused by all types of user behavior is abnormal and network security, and manually exempt from Epidemic disease system in intrusion detection problem using similitude with higher, as utilized the dendron shape in artificial immunity danger theory Cell algorithm (Dendritic Cell Algorithm, DCA) constructs Integrated Intrusion Detection (RSAI-IID) model, or carries out Spam mass-sending detection and Web server abnormality detection etc., wherein Dendritic Cells algorithm has computational efficiency height, can reduce Rate of false alarm and rate of failing to report are not necessarily to the features such as training dataset.
The invention is realized in this way a kind of microblog water army detection method based on artificial immunity danger theory, the base Include: in the microblog water army detection method of artificial immunity danger theory
Using focused web crawler obtain microblog users behavioral data, using artificial immunity to microblog users behavioural characteristic into Row detection;
User behavior characteristics are analyzed and defined with network navy behavior, distinguishes the novel waterborne troops of network and normal users Characteristic attribute;
Using the network navy user behavior in the Dendritic Cells algorithm DCA detection microblogging of artificial immunity danger theory.
Further, the microblog water army detection method based on artificial immunity danger theory specifically includes:
The acquisition of microblog data: step 1 uses focused web crawler, crawls to the user information of microblogging;The reality of invention Data are tested by calling Sina weibo api interface and Python to write focused web crawler two ways and obtained, and Duplicate removal is carried out to these data, the pretreatment such as remove sky;
Step 2, the selection of feature: number of fans, attention number, microblogging sum, original microblogging in extracting user's microblogging Count, whether authenticate, microblogging grade, whether there is or not brief introduction, registion time, sunlight credit, mutual attention number, participate in topic number, comment number, Forwarding number and after thumbing up 14 kinds of user behavior characteristics of number, by multiple comparative experiments and is summarized user behavior characteristics original in 14 Sunlight credit, liveness, identity evaluation, influence power, bean vermicelli concern are fused to than, than 6 indexs of original microblogging;
Antigen signals definition: step 3 sunlight credit SC, liveness AT, identity evaluation IE, influence power CI, bean vermicelli is closed Note carries out normalization processing than FF, original microblogging ratio 6 indexs of OM, and mapping function is as follows:Its Middle x is original signal value, as x ∈ [m, n], carry out Linear Mapping, when x ∈ [n, ∞) when, signal is maximized 10;
Step 4, the microblog water army detection based on DCA algorithm: using microblog users as antigen, initialization antigen first is adopted Collect number and Dendritic Cells population;Unrecognized microblog users are selected in microblog users detection sample at random, according to micro- The rich corresponding pathogen associated molecular pattern signal of user, danger signal, safety signal and the scorching signal of cause are as input signal;
It is same to offering according to calculation formula is following and its concentration of CSMI, SEM, MAT is calculated in corresponding weight matrix CSM, SEMI, MAT concentration that the DC cell of one antigen is obtained add up;
The calculation formula of DCA algorithm is as follows:
(1+IS) is amplified signal in formula, and the corresponding value of input signal PAMP, DS, SS and weight are CP, CD, CS respectively And WP, WD, WS, the corresponding value of output signal CSM, SEM and MAT is respectively C[CSM], C[SEM]And C[MAT]
CSM, SEM and MAT value are calculated according to input signal values and weight matrix, and is added up.If CSM is greater than migration Threshold value then compares the size of SEMI and MAT, and the state of the DC and the antigen state of DC acquisition are marked according to comparison result; If antigen determines that total degree reaches antigen discrimination threshold, cell maturation antigen value MCAV, formula MCAV=MAT/ are calculated (SEM+MAT), wherein SEM and MAT be output signal SEM, MAT value.Compare the size of MCAV and outlier threshold, if MCAV Larger, then antigenic mark is abnormal, which is waterborne troops, otherwise labeled as normal.
Further, in step 1, crawling method includes that simulation logs in, obtains station address link and HTML code parsing;
(1) simulation logs in: after network address authenticates successfully, being logged in;
(2) station address link: the division according to Sina weibo to user authentication type is obtained, is had without Sina's certification Ordinary user, the personal authentication user for being identified as yellow V or gold V, the enterprise institution certification user for being identified as blue V;Different type is recognized The user home page or the second level page of card have different URL link templates;
(3) HTML code parses: by being logged in advance with after target URL definition, utilizing what is carried in Python The library urllib, urllib2 is carried out a variety of parsings to the Html of URL and operated, or opened using an advanced crawler of Python It sends out frame Scrapy and carries out the positioning of Html page info;Carry out the information scratching of web page.
Further, in step 2, fusion method includes:
1) sunlight credit SC point be extremely low 300-419, it is lower 420-450, general 451-570, preferable 571-690, fabulous 691-900 grade is indicated using numerical value 1-5 respectively in fusion;
2) liveness AT, including microblogging sum M, participation topic number T, registion time Z, current time N, wherein " N-Z " is tied For fruit with " day " for unit, calculation is as follows,
AT=(0.7M+0.3T)/(N-Z);
3) identity evaluate IE, respectively whether there is or not brief introduction I, whether authenticate C and number of degrees G, to each attribute weight distinguish It is 0.2,0.4,0.4, calculation is as follows,
IE=0.2I+0.4C+0.4G;
4) influence power CI, respectively comment number J, forwarding number R, thumb up several F and the sent out microblogging of user by comment number, quilt Forwarding number and number being thumbed up, the weight of each attribute is respectively 0.3,0.5,0.2, and calculation is as follows,
CI=0.3J+0.5R+0.2F;
5) ratio of the bean vermicelli concern than number of fans Fans and attention number Followers that FF is each user, calculation method It is as follows,
FF=Fans/Followers;
6) original microblogging ratio OM is microblogging sum M ratio shared by original microblogging Weibo_Original in microblogging transmitted by user Example;Calculation is as follows, OM=(Weibo_Original)/M.
Further, in step 3,4 kinds of input signals of the index of correlation and DCA algorithm that detect Sina weibo waterborne troops map Include:
Pathogen associated molecular pattern PAMP: showing user behavior exception, and there are the features of waterborne troops's behavior, defines PAMP= {<SC,IE,FF>};
Danger signal DS: showing that a possibility that user behavior is abnormal, abnormal is higher, and only normally performed activity changes, but There are the possibility of waterborne troops's behavior, define DS={<AT, CI, OM>};
Safety signal SS: it indicates that a possibility that user is normal is higher, and is in normal condition, define SS={<SC, IE>};
Pro-inflammatory cytokine IS: showing active user generally there are exception, plays the role of amplifying PAMP, DS, SS signal, fixed Adopted IS={<CI>}.
Another object of the present invention is to provide a kind of computer program, based on people described in the computer program operation The microblog water army detection method of work Danger Immune theory.
Another object of the present invention is to provide a kind of terminal, it is described based on artificial immunity that the terminal at least carries realization The controller of the microblog water army detection method of danger theory.
Another object of the present invention is to provide a kind of computer readable storage mediums, including instruction, when it is in computer When upper operation, so that computer executes the microblog water army detection method based on artificial immunity danger theory.
Another object of the present invention is to provide the microblog water armies based on artificial immunity danger theory described in a kind of realize The microblog water army detection system based on artificial immunity danger theory of detection method, it is described based on the micro- of artificial immunity danger theory Winning waterborne troops's detection system includes:
Microblog data is obtained module and is crawled using focused web crawler to the user information of microblogging;
Characteristic selecting module: in extracting user's microblogging number of fans, attention number, microblogging sum, original microblog number, whether Certification, microblogging grade, whether there is or not brief introduction, registion time, sunlight credit, mutual attention number, participation topic number, comment numbers, forwarding number After thumbing up 14 kinds of user behavior characteristics of number, user behavior characteristics original in 14 are fused to summary by multiple comparative experiments Sunlight credit, liveness, identity evaluation, influence power, bean vermicelli concern are than, than 6 indexs of original microblogging;
Sunlight credit SC, liveness AT, identity evaluation IE, influence power CI, bean vermicelli are paid close attention to ratio by antigen signals definition module FF, original microblogging ratio 6 indexs of OM carry out normalization processing;
Microblog water army detection module based on DCA algorithm, using microblog users as antigen, initialization antigen first acquires number Mesh and Dendritic Cells population;Unrecognized microblog users are selected in microblog users detection sample at random, are used according to microblogging The corresponding pathogen associated molecular pattern signal in family, danger signal, safety signal and the scorching signal of cause are as input signal;According to The concentration of CSMI, SEM, MAT is calculated in DCA algorithm calculation formula and its corresponding weight matrix, to offering same antigen CSM, SEMI, MAT concentration that DC cell is obtained add up;If CSM is greater than mobility threshold, compare the big of SEMI and MAT It is small, the state of the DC and the antigen state of DC acquisition are marked according to comparison result;If antigen determines that total degree reaches anti- Former discrimination threshold then calculates cell maturation antigen value MCAV, compares the size of MCAV and outlier threshold, if MCAV is larger, Antigenic mark is exception, which is waterborne troops, otherwise labeled as normal.
Another object of the present invention is to provide a kind of micro blog network platform, described in the micro blog network platform at least carries The microblog water army detection system based on artificial immunity danger theory.
In step 4, the corresponding signal weight matrix of DCA algorithm is as follows:
Weight shows that more greatly the influence degree of its output to corresponding signal is bigger in weight matrix, and weight is negative value, that is, generation Table its be negatively influencing to the output of corresponding signal.Input signal values are converted to the calculating of output signal value;Wherein (1+IS) is to put Big signal, Wp, WD, Ws are to calculate output signal (C[CSM],C[SEM]And C[MAT]) when each input signal shared by weight, according to It is obtained in weight matrix, such as calculates C[CSM]When, Wp=8, WD=4, Ws=-6.
Further, it also needs to carry out experiment detection scheme after step 4: can detect by above-mentioned steps each micro- in experiment sample Whether rich user is waterborne troops, testing result and truthful data is based on, using accuracy rate (PR), recall rate (RR) and harmonic-mean This 3 kinds of indexs of F1 detect the accuracy of this method.Accuracy rate, recall rate and harmonic-mean are higher, then the effect of waterborne troops's detection Fruit is better.The calculation of each index is as follows.
Accuracy rate calculation formula are as follows:Class in formula+=TP/ (TP+FP), class-= TN/ (TN+FN) respectively indicates classifier to the classification accuracy of microblog water army and normal users, and TP, TN are in sample respectively Detection waterborne troops number and detect non-waterborne troops's number, FN, FP are the practical waterborne troops's number and non-waterborne troops's number of identification mistake respectively.PR indicates to divide Class device Average Accuracy.The height of Average Accuracy PR is by class+、class-The height of the two value codetermines.
Recall rate calculation formula are as follows: RR=TP/ (TP+FN).
Harmonic-mean calculation formula are as follows: F1=(2*PR*RR)/(PR+RR).
In conclusion advantages of the present invention and good effect are as follows:
The thought that the present invention plans artificial immunity is applied in the detection of microblog users behavioural characteristic, is climbed using focused web Worm is easy, quickly obtains microblog users data, portrays definition network navy by the analysis method based on user behavior characteristics Behavior obtains the characteristic attribute of energy effective district subnetwork novel waterborne troops and normal users, finally by artificial immunity danger theory Signal processing mechanism is applied in network navy detection, using the core algorithm of danger theory --- Dendritic Cells algorithm (DCA) the waterborne troops user in microblogging is detected.
The present invention uses for reference the thought of Immune System, proposes and carries out microblogging net with the DCA algorithm in danger theory The detection of network waterborne troops user.By the behavioural characteristic of waterborne troops user in analysis Sina weibo, according to microblogging normal users and waterborne troops Difference of the user in the performance of the features such as forwarding, comment, sunlight credit, judges whether there is waterborne troops's behavior.The present invention and tradition Waterborne troops's recognition detection method compare, have several advantages that
(1) present invention analyses in depth waterborne troops's behavioural characteristic, and these user characteristics are defined on this basis, has There is the characteristics of dynamic and adaptivity, compared with traditional detection technique based on content characteristic, can more effectively find do not have There is the novel waterborne troops of network of significant recognizable feature.
(2) present invention obtains microblog users data using the mode of the focused web crawler based on Python, and with structure To database, which is easier to obtain data set for the data storage of change, can reasonably acquire all kinds of behavioral datas of user, Have many advantages, such as to crawl that the period is short, the quality of data is high.
(3) 14 kinds of user behavior characteristics such as number of fans, attention number, microblogging sum in comprehensive analysis user microblogging of the present invention, By its fusion treatment at six Xiang Zhibiao, and with the detection of the DCA algorithm progress microblog water army in artificial immunity danger theory.This More comprehensively to the analysis of feature, data-handling efficiency is high, and does not need mass data collection and be trained for invention.
(4) present invention realizes microblogging water by the algorithm of DCA using the data set obtained based on focused web crawler strategy The detection of army compares 3 indexs such as accuracy rate, recall rate and harmonic-mean in experimental result.The present invention selects front Mentioned waterborne troops's detection algorithm based on content characteristic and waterborne troops's detection algorithm based on user characteristics is compared.Experiment knot The accuracy rate that fruit shows that waterborne troops's detection algorithm based on content characteristic identifies new network waterborne troops is lower, special based on user Waterborne troops's detection algorithm of sign has higher accuracy rate, but recall rate is lower when for the data processing of high-magnitude, in the present invention Navy detection method (DCA algorithm) has preferable applicability and higher accuracy rate for the recognition detection of new network waterborne troops.
Detailed description of the invention
Fig. 1 is the microblog water army detection method flow chart provided in an embodiment of the present invention based on artificial immunity danger theory.
Fig. 2 is the microblog water army detecting system schematic diagram provided in an embodiment of the present invention based on artificial immunity danger theory.
In figure: 1, microblog data obtains module;2, characteristic selecting module;3, antigen signals definition module;4, it is calculated based on DCA The microblog water army detection module of method.
Fig. 3 is the microblog water army detection method provided in an embodiment of the present invention based on artificial immunity danger theory and other water The experimental result comparison diagram of army's detection method.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to embodiments, to the present invention It is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not used to Limit the present invention.
DT (Danger Theory) danger theory, is one of artificial immune system research theory;DCA (Dendritic Cell Algorithm) Dendritic Cells algorithm;PAMP(Pathogen-associated Molecular Patterns) pathogen associated molecular pattern;DS (Danger Signal) danger signal;SS (Safe Signal) believes safely Number;IS (Inflammatory Signal) causes scorching signal;CSM (costimulatory molecules) costimulatory molecules
DCA algorithm principle: DCA algorithm is mainly to simulate to be proposed as the function of the dendritic cells of antigen presenting cell , input signal includes four kinds: (1) PAMP signal (pathogen associated molecular pattern);(2) DS signal (danger signal);(3) SS signal (safety signal): the signal that cell natural death generates represents the normal behaviour in system;(4) IS signal (causes scorching letter Number).After carrying out fusion treatment by correlation function and weight matrix to input signal, following three kinds of signals: (1) CSM are exported (costimulatory molecules) costimulatory molecules: the value be used to judge when immature DC cell starts to break up, when When CSM > mobility threshold, immature DC begins to differentiate into half ripe DC or maturation DC;(2) half ripe DC cell (semi- Mature): indicating the safe coefficient of current cellular environment, while all antigens that the DC is absorbed are offered as safe antigen;(3) Mature DC cell (mature) indicates the degree of danger of current cellular environment, while all antigens that the DC is absorbed are offered as danger Dangerous antigen.When antigen, which reaches, differentiates number, the mature environmental antigens value MCAV (MCAV for representing the antigen intensity of anomaly is calculated =antigenic mark is the labeled total degree of number/antigen of dangerous antigen).
Below with reference to instance analysis, the invention will be further described.
Microblog water army detection method provided in an embodiment of the present invention based on artificial immunity danger theory, comprising:
Step 1, the acquisition of microblog data
The present invention uses focused web crawler, realizes that the user information for microblogging crawls.Crawling process mainly includes mould It is quasi- to log in, obtain station address link and HTML code parsing.
(1) simulation logs in: selecting " https: //login.sina.com.cn/signup/signin.php " as simulation The address logged in is gone weibo.com or weibo.cn to authenticate, is then obtained using the cookie after the network address authenticates successfully The cookie of weibo.com or weibo.cn certification, realizes Session session, and Session mechanism passes through Cookie and URL weight It is realistic now to log in.
(2) obtain station address link: the division according to Sina weibo to user authentication type, have ordinary user (without Sina's certification), personal authentication user's (being identified as yellow V or gold V etc.), enterprise institution authenticate user (being identified as blue V) etc., inhomogeneity The user home page or the second level page of type certification have different URL link templates, and to protect privacy of user, hereinafter " * * * " is marked Know the UID or sensitive data that position is user, is exemplified below.
Https: //weibo.com/u/***: unverified and personal authentication individual subscriber homepage link;
Https: //weibo.com/***: having authenticated the individual subscriber homepage link of enterprise, group, mechanism;
Https: //weibo.com/p/xxxxx***/info? mod=pedit_more: the message details of all types of user Page link, wherein " xxxxx " is the character string of 5 0-9, different user has different character strings.It is captured by Fiddler Ajax request is to the path for storing this page, by analysis it is found that the path string in link is that 5 bit digitals add user UID composition, therefore can be linked according to this rule creation user's details page.
(3) HTML code parses: by being logged in advance with after target URL definition, crawler saves the Html under the link, but Content is complicated in Html file, needs to screen information position required for positioning according to keyword.It can use Python at this time The library urllib, urllib2 carried in language carries out a variety of parsings to the Html of URL and operates, Python can also be used One advanced crawler Development Framework Scrapy carries out Html page info positioning.Scrapy can be according to the use demand of user It carries out applicability modification, while can fast and accurately carry out the information scratching of web page.
Step 2, the selection of feature
The present invention in extracting user's microblogging number of fans, attention number, microblogging sum, original microblog number, whether authenticate, be micro- Rich grade, whether there is or not brief introduction, registion time, sunlight credit, mutual attention number, participate in topic number, comment number, forwarding number and thumb up number After 14 kinds of user behavior characteristics, user behavior characteristics original in 14 are fused to by sunlight letter by multiple comparative experiments and summary With, liveness, identity evaluation, influence power, bean vermicelli concern than 6 indexs such as, original microblogging ratios.Fusion process is as follows:
(1) sunlight credit (SC) is divided into extremely low (300-419), lower (420-450), general (451-570), preferably (571-690), fabulous (691-900) 5 grades, are indicated using numerical value 1-5 respectively in fusion;
(2) liveness (AT) is related to multiple attribute variables, including microblogging total (M), participation topic number (T), registion time (Z), current time (N), wherein " N-Z " result with " day " be unit, calculation is as follows,
AT=(0.7M+0.3T)/(N-Z);
(3) identity evaluation (IE), be related to multiple attribute variables, respectively whether there is or not brief introduction (I), whether authenticate (C) and grade Number (G), the weight to each attribute is respectively 0.2,0.4,0.4, and calculation is as follows,
IE=0.2I+0.4C+0.4G;
(4) influence power (CI) is related to multiple attribute variables, respectively comment number (J), forwarding number (R), thumbs up number (F), and The sent out microblogging of user by comment number, be forwarded number and thumbed up number, the weight of each attribute is respectively 0.3,0.5,0.2, meter Calculation mode is as follows,
CI=0.3J+0.5R+0.2F;
(5) ratio of the bean vermicelli concern than number of fans (Fans) and attention number (Followers) that (FF) is each user, meter Calculation method is as follows,
FF=Fans/Followers;
(6) original microblogging ratio (OM) is that microblogging shared by original microblogging (Weibo_Original) is total in microblogging transmitted by user Number (M) ratio.Calculation is as follows,
OM=(Weibo_Original)/M;
Step 3, antigen signals define
By sunlight credit (SC), liveness (AT), identity evaluation (IE), influence power (CI), bean vermicelli concern than (FF), original 6 indexs such as microblogging ratio (OM) carry out normalization processing, and mapping function is as follows:
Wherein x is original signal value, as x ∈ [m, n], carry out Linear Mapping, when x ∈ [n, ∞) when, signal takes maximum Value 10.
4 kinds of input signals of the index of correlation and DCA algorithm that detect Sina weibo waterborne troops have following mapping:
Pathogen associated molecular pattern PAMP: showing user behavior exception, and there are the features of waterborne troops's behavior, defines PAMP= {<SC,IE,FF>};
Danger signal DS: showing that a possibility that user behavior is abnormal, abnormal is higher, may be that normally performed activity changes Become, but there are the possibility of waterborne troops's behavior, define DS={<AT, CI, OM>};
Safety signal SS: it indicates that a possibility that user is normal is higher, and is in normal condition, define SS={<SC, IE>};
Pro-inflammatory cytokine IS: showing active user generally there are exception, plays the role of amplifying PAMP, DS, SS signal, fixed Adopted IS={<CI>}.
Step 4, the microblog water army detection based on DCA algorithm
It is as shown in Figure 1 applied to DCA algorithm flow of the invention.
Specific implementation are as follows: using microblog users as antigen, initialization antigen first acquires number and Dendritic Cells Population;At random in microblog users
Unrecognized microblog users are selected in detection sample, according to the corresponding pathogen associated molecular pattern of microblog users Signal, danger signal, safety signal and cause scorching signal as input signal, according to calculation formula is following and its corresponding weight square The concentration of CSMI, SEM, MAT is calculated in battle array, CSM, SEMI, MAT concentration that the DC cell for offering same antigen is obtained into Row is cumulative.
The calculation formula of DCA algorithm is as follows:
The corresponding signal weight matrix of DCA algorithm is as follows:
If CSM be greater than mobility threshold, compare the size of SEMI and MAT, according to comparison result mark the state of the DC with And the antigen state of DC acquisition.If antigen determines that total degree reaches antigen discrimination threshold, cell maturation antigen value is calculated (MCAV), formula is MCAV=MAT/ (SEM+MAT), compares the size of MCAV and outlier threshold, if MCAV is larger, antigen Labeled as exception, i.e., the microblog users are waterborne troops, otherwise labeled as normal.
Such as Fig. 2, the embodiment of the present invention provides a kind of microblog water army detection system based on artificial immunity danger theory, packet It includes:
Microblog data is obtained module 1 and is crawled using focused web crawler to the user information of microblogging;
Characteristic selecting module 2: in extracting user's microblogging number of fans, attention number, microblogging sum, original microblog number, be Deny that card, microblogging grade, whether there is or not brief introduction, registion time, sunlight credit, mutual attention number, participation topic number, comment numbers, forwarding After counting and thumbing up 14 kinds of user behavior characteristics of number, user behavior characteristics original in 14 are merged with summary by multiple comparative experiments It is sunlight credit, liveness, identity evaluation, influence power, bean vermicelli concern than, than 6 indexs of original microblogging;
Antigen signals definition module 3 pays close attention to sunlight credit SC, liveness AT, identity evaluation IE, influence power CI, bean vermicelli Normalization processing is carried out than FF, original microblogging ratio 6 indexs of OM;
Microblog water army detection module 4 based on DCA algorithm, using microblog users as antigen, initialization antigen acquisition first Number and Dendritic Cells population;Unrecognized microblog users are selected in microblog users detection sample at random, according to microblogging The corresponding pathogen associated molecular pattern signal of user, danger signal, safety signal and the scorching signal of cause are as input signal;According to The concentration of CSMI, SEM, MAT is calculated in DCA algorithm calculation formula and its corresponding weight matrix, to offering same antigen CSM, SEMI, MAT concentration that DC cell is obtained add up;If CSM is greater than mobility threshold, compare the big of SEMI and MAT It is small, the state of the DC and the antigen state of DC acquisition are marked according to comparison result;If antigen determines that total degree reaches anti- Former discrimination threshold then calculates cell maturation antigen value MCAV, compares the size of MCAV and outlier threshold, if MCAV is larger, Antigenic mark is exception, which is waterborne troops, otherwise labeled as normal.
Below with reference to experiment effect, the invention will be further described.
Such as Fig. 3, the present invention is crawled module by microblog data and obtains user data, examined using the waterborne troops based on DCA algorithm Module is surveyed to detect user data.The evaluation index of experiment includes accuracy rate (PR), recall rate (RR) and harmonic-mean The value of F1, three evaluation indexes are higher, then the effect of waterborne troops's detection is better.
The present invention traditional navy detection method of the selection based on content characteristic and based on user characteristics with based on DCA algorithm Waterborne troops's detection module compares experiment, 3 indexs such as accuracy rate, recall rate and harmonic-mean in comparative experiments result. Verifying and comparative experiments are carried out to algorithm validity using Sina weibo user truthful data, the experimental results showed that in the present invention Method can effectively detect the waterborne troops user in Sina weibo, have higher Detection accuracy.
In the above-described embodiments, can come wholly or partly by software, hardware, firmware or any combination thereof real It is existing.When using entirely or partly realizing in the form of a computer program product, the computer program product include one or Multiple computer instructions.When loading on computers or executing the computer program instructions, entirely or partly generate according to Process described in the embodiment of the present invention or function.The computer can be general purpose computer, special purpose computer, computer network Network or other programmable devices.The computer instruction may be stored in a computer readable storage medium, or from one Computer readable storage medium is transmitted to another computer readable storage medium, for example, the computer instruction can be from one A web-site, computer, server or data center pass through wired (such as coaxial cable, optical fiber, Digital Subscriber Line (DSL) Or wireless (such as infrared, wireless, microwave etc.) mode is carried out to another web-site, computer, server or data center Transmission).The computer-readable storage medium can be any usable medium or include one that computer can access The data storage devices such as a or multiple usable mediums integrated server, data center.The usable medium can be magnetic Jie Matter, (for example, floppy disk, hard disk, tape), optical medium (for example, DVD) or semiconductor medium (such as solid state hard disk Solid State Disk (SSD)) etc..
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Made any modifications, equivalent replacements, and improvements etc., should all be included in the protection scope of the present invention within mind and principle.

Claims (10)

1. a kind of microblog water army detection method based on artificial immunity danger theory, which is characterized in that described to be based on artificial immunity The microblog water army detection method of danger theory includes:
Microblog users behavioral data is obtained using focused web crawler, microblog users behavioural characteristic is examined using artificial immunity It surveys;
User behavior characteristics are analyzed and defined with network navy behavior, distinguishes the feature of network novel waterborne troops and normal users Attribute;
Using the network navy user behavior in the Dendritic Cells algorithm DCA detection microblogging of artificial immunity danger theory.
2. as described in claim 1 based on the microblog water army detection method of artificial immunity danger theory, which is characterized in that be based on The microblog water army detection method of artificial immunity danger theory specifically includes:
The acquisition of microblog data: step 1 uses focused web crawler, crawls to the user information of microblogging;
Step 2, the selection of feature: in extracting user's microblogging number of fans, attention number, microblogging sum, original microblog number, be Deny that card, microblogging grade, whether there is or not brief introduction, registion time, sunlight credit, mutual attention number, participation topic number, comment numbers, forwarding After counting and thumbing up 14 kinds of user behavior characteristics of number, user behavior characteristics original in 14 are merged with summary by multiple comparative experiments It is sunlight credit, liveness, identity evaluation, influence power, bean vermicelli concern than, than 6 indexs of original microblogging;
Antigen signals definition: sunlight credit SC, liveness AT, identity evaluation IE, influence power CI, bean vermicelli are paid close attention to ratio by step 3 FF, original microblogging ratio 6 indexs of OM carry out normalization processing, and mapping function is as follows:Wherein x Original signal value, as x ∈ [m, n], carry out Linear Mapping, when x ∈ [n, ∞) when, signal is maximized 10;
Step 4, the microblog water army detection based on DCA algorithm: using microblog users as antigen, initialization antigen first acquires number Mesh and Dendritic Cells population;Unrecognized microblog users are selected in microblog users detection sample at random, are used according to microblogging The corresponding pathogen associated molecular pattern signal in family, danger signal, safety signal and the scorching signal of cause are as input signal;
According to calculation formula is following and its concentration of CSMI, SEM, MAT is calculated in corresponding weight matrix, to offering same primary antibody CSM, SEMI, MAT concentration that former DC cell is obtained add up;
The calculation formula of DCA algorithm is as follows:
(1+IS) is amplified signal in formula, the corresponding value of input signal PAMP, DS, SS and weight be respectively CP, CD, CS and WP, WD, WS, the corresponding value of output signal CSM, SEM and MAT is respectively C[CSM], C[SEM]And C[MAT]
CSM, SEM and MAT value are calculated according to input signal values and weight matrix, and is added up.If CSM is greater than migration threshold Value, then compare the size of SEM and MAT, and the state of the DC and the antigen state of DC acquisition are marked according to comparison result.If Antigen determines that total degree reaches antigen discrimination threshold, then calculates cell maturation antigen value MCAV, and formula is MCAV=MAT/ (SEM+ MAT), wherein SEM and MAT be output signal SEM, MAT value;Compare the size of MCAV and outlier threshold, if MCAV is larger, Then antigenic mark is abnormal, which is waterborne troops, otherwise labeled as normal.
3. as claimed in claim 2 based on the microblog water army detection method of artificial immunity danger theory, which is characterized in that step In one, crawling method includes that simulation logs in, obtains station address link and HTML code parsing;
(1) simulation logs in: after network address authenticates successfully, being logged in;
(2) station address link: the division according to Sina weibo to user authentication type is obtained, is had without the common of Sina's certification User, the personal authentication user for being identified as yellow V or gold V, the enterprise institution certification user for being identified as blue V;Different type certification User home page or the second level page have different URL link templates;
(3) HTML code parse: by log in advance with target URL definition after, using in Python carry urllib, The library urllib2 carries out a variety of parsings to the Html of URL and operates, or utilizes an advanced crawler Development Framework of Python Scrapy carries out the positioning of Html page info;Carry out the information scratching of web page.
4. as claimed in claim 2 based on the microblog water army detection method of artificial immunity danger theory, which is characterized in that step In two, fusion method includes:
1) sunlight credit SC points are extremely low 300-419, lower 420-450, general 451-570, preferable 571-690, fabulous 691- 900 grades are indicated using numerical value 1-5 respectively in fusion;
2) liveness AT, including microblogging sum M, participate in topic number T, registion time Z, current time N, wherein " N-Z " result with " day " is unit, and calculation is as follows,
AT=(0.7M+0.3T)/(N-Z);
3) identity evaluate IE, respectively whether there is or not brief introduction I, whether authenticate C and number of degrees G, the weight to each attribute is respectively 0.2,0.4,0.4, calculation is as follows,
IE=0.2I+0.4C+0.4G;
4) influence power CI, respectively comment number J, forwarding number R, thumb up several F and the sent out microblogging of user by comment number, be forwarded It counting and is thumbed up number, the weight of each attribute is respectively 0.3,0.5,0.2, and calculation is as follows,
CI=0.3J+0.5R+0.2F;
5) than the ratio of number of fans Fans and attention number Followers that FF is each user, calculation method is as follows for bean vermicelli concern,
FF=Fans/Followers;
6) original microblogging ratio OM is microblogging sum M ratio shared by original microblogging Weibo_Original in microblogging transmitted by user; Calculation is as follows, OM=(Weibo_Original)/M.
5. as claimed in claim 2 based on the microblog water army detection method of artificial immunity danger theory, which is characterized in that step In three, 4 kinds of input signals mapping of the index of correlation and DCA algorithm that detect Sina weibo waterborne troops includes:
Pathogen associated molecular pattern PAMP: showing user behavior exception, there are the feature of waterborne troops's behavior, define PAMP=< SC,IE,FF>};
Danger signal DS: showing that a possibility that user behavior is abnormal, abnormal is higher, and only normally performed activity changes, but exists The possibility of waterborne troops's behavior defines DS={<AT, CI, OM>};
Safety signal SS: it indicates that a possibility that user is normal is higher, and is in normal condition, define SS={<SC, IE>};
Pro-inflammatory cytokine IS: showing active user generally there are exception, plays the role of amplifying PAMP, DS, SS signal, defines IS ={<CI>}.
6. a kind of computer program, which is characterized in that described in the computer program operation Claims 1 to 5 any one Microblog water army detection method based on artificial immunity danger theory.
7. a kind of terminal, which is characterized in that the terminal, which is at least carried, to be realized described in Claims 1 to 5 any one based on people The controller of the microblog water army detection method of work Danger Immune theory.
8. a kind of computer readable storage medium, including instruction, when run on a computer, so that computer is executed as weighed Benefit requires the microblog water army detection method described in 1-5 any one based on artificial immunity danger theory.
9. it is a kind of realize the microblog water army detection method described in claim 1 based on artificial immunity danger theory based on artificial The microblog water army detection system of Danger Immune theory, which is characterized in that the microblog water army based on artificial immunity danger theory Detection system includes:
Microblog data is obtained module and is crawled using focused web crawler to the user information of microblogging;
Characteristic selecting module: in extracting user's microblogging number of fans, attention number, microblogging sum, original microblog number, be to deny Card, microblogging grade, whether there is or not brief introduction, registion time, sunlight credit, mutual attention number, participate in topic number, comment number, forwarding number and After thumbing up 14 kinds of user behavior characteristics of number, user behavior characteristics original in 14 are fused to by sun by multiple comparative experiments and summary Light credit, liveness, identity evaluation, influence power, bean vermicelli concern are than, than 6 indexs of original microblogging;
Antigen signals definition module, by sunlight credit SC, liveness AT, identity evaluation IE, influence power CI, bean vermicelli concern than FF, 6 indexs of original microblogging ratio OM carry out normalization processing;
Microblog water army detection module based on DCA algorithm, using microblog users as antigen, first initialization antigen acquisition number with Dendritic Cells population;Unrecognized microblog users are selected in microblog users detection sample at random, according to microblog users pair The scorching signal of pathogen associated molecular pattern signal, danger signal, safety signal and cause answered is as input signal;It is calculated according to DCA The concentration of CSMI, SEM, MAT is calculated in method calculation formula and its corresponding weight matrix, to the DC cell for offering same antigen CSM, SEMI, MAT concentration obtained adds up;If CSM is greater than mobility threshold, compare the size of SEMI and MAT, according to Comparison result marks the state of the DC and the antigen state of DC acquisition;If antigen determines that total degree reaches antigen and differentiates threshold Value, then calculate cell maturation antigen value MCAV, compare the size of MCAV and outlier threshold, if MCAV is larger, antigenic mark For exception, which is waterborne troops, otherwise labeled as normal.
10. a kind of micro blog network platform, which is characterized in that the micro blog network platform at least carries base as claimed in claim 9 In the microblog water army detection system of artificial immunity danger theory.
CN201810950560.1A 2018-08-20 2018-08-20 Microblog water army detection method and detection system based on artificial immune hazard theory Active CN109558555B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810950560.1A CN109558555B (en) 2018-08-20 2018-08-20 Microblog water army detection method and detection system based on artificial immune hazard theory

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810950560.1A CN109558555B (en) 2018-08-20 2018-08-20 Microblog water army detection method and detection system based on artificial immune hazard theory

Publications (2)

Publication Number Publication Date
CN109558555A true CN109558555A (en) 2019-04-02
CN109558555B CN109558555B (en) 2020-05-05

Family

ID=65864492

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810950560.1A Active CN109558555B (en) 2018-08-20 2018-08-20 Microblog water army detection method and detection system based on artificial immune hazard theory

Country Status (1)

Country Link
CN (1) CN109558555B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110287322A (en) * 2019-06-27 2019-09-27 有米科技股份有限公司 Moisture flow processing method, system and the equipment of social media flow
CN110297990A (en) * 2019-05-23 2019-10-01 东南大学 The associated detecting method and system of crowdsourcing marketing microblogging and waterborne troops
CN111159399A (en) * 2019-12-13 2020-05-15 天津大学 Automobile vertical website water army discrimination method
CN113806616A (en) * 2021-08-16 2021-12-17 北京智慧星光信息技术有限公司 Microblog user identification method, system, electronic equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103077240A (en) * 2013-01-10 2013-05-01 北京工商大学 Microblog water army identifying method based on probabilistic graphical model
CN103198161A (en) * 2013-04-28 2013-07-10 中国科学院计算技术研究所 Microblog ghostwriter identifying method and device
CN106940732A (en) * 2016-05-30 2017-07-11 国家计算机网络与信息安全管理中心 A kind of doubtful waterborne troops towards microblogging finds method
US20180083903A1 (en) * 2016-09-21 2018-03-22 King Fahd University Of Petroleum And Minerals Spam filtering in multimodal mobile communication
CN107895010A (en) * 2017-11-13 2018-04-10 华东师范大学 A kind of method that detection network navy is thumbed up based on network
CN108197696A (en) * 2018-01-31 2018-06-22 湖北工业大学 A kind of network navy account recognition methods and system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103077240A (en) * 2013-01-10 2013-05-01 北京工商大学 Microblog water army identifying method based on probabilistic graphical model
CN103198161A (en) * 2013-04-28 2013-07-10 中国科学院计算技术研究所 Microblog ghostwriter identifying method and device
CN106940732A (en) * 2016-05-30 2017-07-11 国家计算机网络与信息安全管理中心 A kind of doubtful waterborne troops towards microblogging finds method
US20180083903A1 (en) * 2016-09-21 2018-03-22 King Fahd University Of Petroleum And Minerals Spam filtering in multimodal mobile communication
CN107895010A (en) * 2017-11-13 2018-04-10 华东师范大学 A kind of method that detection network navy is thumbed up based on network
CN108197696A (en) * 2018-01-31 2018-06-22 湖北工业大学 A kind of network navy account recognition methods and system

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
张超 等: "基于树突状细胞算法的垃圾邮件群发检测", 《传感器与微系统》 *
杨超 等: "基于人工免疫危险理论的微博水军用户检测研究", 《计算机科学》 *
王志召: "微博客数据分析系统的设计与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110297990A (en) * 2019-05-23 2019-10-01 东南大学 The associated detecting method and system of crowdsourcing marketing microblogging and waterborne troops
CN110287322A (en) * 2019-06-27 2019-09-27 有米科技股份有限公司 Moisture flow processing method, system and the equipment of social media flow
CN110287322B (en) * 2019-06-27 2021-04-16 有米科技股份有限公司 Water flow processing method, system and equipment for social media flow
CN111159399A (en) * 2019-12-13 2020-05-15 天津大学 Automobile vertical website water army discrimination method
CN113806616A (en) * 2021-08-16 2021-12-17 北京智慧星光信息技术有限公司 Microblog user identification method, system, electronic equipment and storage medium
CN113806616B (en) * 2021-08-16 2023-08-22 北京智慧星光信息技术有限公司 Microblog user identification method, system, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN109558555B (en) 2020-05-05

Similar Documents

Publication Publication Date Title
CN104077396B (en) Method and device for detecting phishing website
CN109558555A (en) Microblog water army detection method and detection system based on artificial immunity danger theory
CN103544255B (en) Text semantic relativity based network public opinion information analysis method
Yuan et al. Reading thieves' cant: automatically identifying and understanding dark jargons from cybercrime marketplaces
CN109922052A (en) A kind of malice URL detection method of combination multiple characteristics
CN104899508B (en) A kind of multistage detection method for phishing site and system
CN103927297A (en) Evidence theory based Chinese microblog credibility evaluation method
JP2014502753A (en) Web page information detection method and system
Hutchinson et al. Detecting phishing websites with random forest
Cresci et al. A Fake Follower Story: improving fake accounts detection on Twitter
Chiew et al. Building standard offline anti-phishing dataset for benchmarking
CN110134876A (en) A kind of cyberspace Mass disturbance perception and detection method based on gunz sensor
Koutsouvelis et al. Detection of insider threats using artificial intelligence and visualisation
CN113901465A (en) Heterogeneous network-based Android malicious software detection method
Yu et al. Detecting malicious web requests using an enhanced textcnn
Thakur et al. Detection of malicious URLs in big data using RIPPER algorithm
Elmas et al. Misleading repurposing on twitter
Wu et al. Malicious website detection based on urls static features
Wu et al. Website defacements detection based on support vector machine classification method
Wei et al. Age: authentication graph embedding for detecting anomalous login activities
Yin et al. Research of integrated algorithm establishment of a spam detection system
CN107239704A (en) Malicious web pages find method and device
Pan Network security and user abnormal behavior detection by using deep neural network
Chen Security precautionary technology for enterprise information resource database based on genetic algorithm in age of big data
Xiao et al. The Challenges of Machine Learning for Trust and Safety: A Case Study on Misinformation Detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant