CN106255116A

CN106255116A - A kind of recognition methods harassing number

Info

Publication number: CN106255116A
Application number: CN201610710545.0A
Authority: CN
Inventors: 王瀚辰; 王彦青
Original assignee: Individual
Current assignee: Individual
Priority date: 2016-08-24
Filing date: 2016-08-24
Publication date: 2016-12-21

Abstract

A kind of recognition methods harassing number, include: choose some confirmed harassing and wrecking and non-harassing and wrecking number, calculate described harassing and wrecking and non-harassing and wrecking number communication behavior index within a period of time, then described harassing and wrecking and non-harassing and wrecking number and communication behavior index thereof are formed training sample set and build random forest disaggregated model, the input of described random forest disaggregated model is the communication behavior index of each Subscriber Number, and output is that all decision trees determine that it is harassing and wrecking number and the prediction probability of non-harassing and wrecking number；By the number to be identified communication behavior index input random forest disaggregated model within a period of time, and calculate all decision trees in random forest disaggregated model and determine that it is harassing and wrecking number and the prediction probability of non-harassing and wrecking number, to judge whether described number to be identified is harassing and wrecking numbers accordingly.The invention belongs to network communication technology field, the call features of calling and called number can be made full use of, from the magnanimity traffic data of existing network, effectively identify harassing and wrecking number.

Description

A kind of recognition methods harassing number

Technical field

The present invention relates to a kind of recognition methods harassing number, belong to network communication technology field.

Background technology

Harassing call, to promote ad content, swindle information, has become as the illegal occupation disturbed social tranquility.Logical Crossing comprehensive analysis, harassing call generally has the following characteristics that

1, called dispersion, harassing and wrecking number breathes out multiple number within the unit interval, and frequency is high, and between each called number Dependency is less；

2, harassing call and called between usual dependency more weak, i.e. history call relation is little, and it is usual to harass number The quantity initiating calling as caller is far longer than it as called quantity；

3, the duration of call of harassing call is the shortest, and the probability of Called Onhook is bigger；

4, harassing call is generally of calling frequency height and integrated distribution in the feature of certain time period.

Patent application CN200910079707.5 (application title: the recognition methods of a kind of harassing call and device, application Day: 2009-03-06, applicant: ZTE Co., Ltd) disclose recognition methods and the device of a kind of harassing call, Introduce the identifying processing to strange telephone number in the mobile phone, by interval to call time of Stranger Calls number, Calling duration length and the statistics of incoming call number of times, automatically compare with the judgment rule of user, identify harassing and wrecking Phone.This technical scheme relates only to the statistics of interval call time, calling duration length and incoming call number of times and knows Do not harass number, it is judged that method is very simple, and the call features underusing calling and called number to talk about from the magnanimity of existing network Business data effectively identify harassing and wrecking number.

Therefore, how to make full use of the call features of calling and called number, effectively identify from the magnanimity traffic data of existing network Harassing and wrecking number, is still a technical problem being worth further investigation.

Summary of the invention

In view of this, it is an object of the invention to provide a kind of recognition methods harassing number, calling and called number can be made full use of The call features of code, effectively identifies harassing and wrecking number from the magnanimity traffic data of existing network.

In order to achieve the above object, the invention provides a kind of recognition methods harassing number, include:

Step one, choose some confirmed harassing and wrecking and non-harassing and wrecking number, calculate described harassing and wrecking and non-harassing and wrecking number one Communication behavior index in the section time, then forms training sample by described harassing and wrecking and non-harassing and wrecking number and communication behavior index thereof Collection builds random forest disaggregated model, and the input of described random forest disaggregated model is that the communication behavior of each Subscriber Number refers to Mark, output is that all decision trees determine that it is harassing and wrecking number and the prediction probability of non-harassing and wrecking number；

Step 2, by number to be identified within a period of time communication behavior index input random forest disaggregated model, and Calculate all decision trees in random forest disaggregated model and determine that it is harassing and wrecking number and the prediction probability of non-harassing and wrecking number, with accordingly Judge whether described number to be identified is harassing and wrecking numbers.

Compared with prior art, the invention has the beneficial effects as follows: call dispersion, called relation loop, exhalation incoming call ratio, exhale Making the communication behavior indexs such as Annual distribution can effectively embody the behavioral characteristic of harassing and wrecking number, the present invention uses random forest Disaggregated model, to call the frequency, called number, the duration of call, ring duration, actively to discharge number of times, passively release number of times, called Dispersion, same caller called number between correlation coefficient, call identical No. ten thousand the section maximum frequencys, caller accounting, call times Multiple communication behavior indexs such as separation standard difference are as input, and are judged to harass number and non-harassing and wrecking number according to all decision trees The probability of code identifies harassing and wrecking number, it is thus possible to utilize the call features of calling and called number, fully excavates in a large amount of training samples Data characteristics, from the magnanimity traffic data of existing network, effectively identify harassing and wrecking number, and communication behavior index can also basis It is actually needed and is adjusted flexibly；Owing to harassing call has, the calling frequency is high and integrated distribution is in certain time period, this The call bill data of whole day is divided into the communication time period with multiple time granularities as duration by invention the most further, and during based on difference Between high-frequency communication period under granularity calculate the various communication behavior indexs of Subscriber Number, it is thus possible to improve harassing and wrecking number further Code identify quasi real time and high efficiency；The present invention can also build multiple random forest disaggregated model, and obtains according to after test The discrimination of random forest disaggregated model therefrom select an optimum random forest disaggregated model.

Accompanying drawing explanation

Fig. 1 is a kind of recognition methods flow chart harassing number of the present invention.

Fig. 2 is the concrete operations flow chart of step A.

Fig. 3 is in step 11, and for kth decision tree, k=1,2 ..., K, it generates the concrete operations flow process of process Figure.

Fig. 4 is the concrete operations flow chart of Fig. 1 step 2.

Fig. 5 is that the present invention builds a test sample collection and tests multiple random forest disaggregated models respectively, and root The concrete operations flow chart of an optimum random forest disaggregated model is therefrom selected according to test result.

Detailed description of the invention

For making the object, technical solutions and advantages of the present invention clearer, below in conjunction with the accompanying drawings the present invention is made further Detailed description.

As it is shown in figure 1, a kind of recognition methods harassing number of the present invention, include:

In step one, can by history it has been acknowledged that harassing and wrecking number (such as by Internet firm obtain or pass through The harassing and wrecking number etc. of operator report and complaint system mark) blacklist and white list workbook choose confirmed harassing and wrecking With non-harassing and wrecking number, then by the way of signal collecting, gather call event letter from equipment such as signaling monitoring system or A mouths Make call bill data or gather history call bill data from BOSS, thus obtaining and above-mentioned choose number communication within a period of time Information, and the communications records that wrong for wherein critical field data form or critical field data exist vacancy value reject.

Owing to harassing call has, the calling frequency is high and integrated distribution is in certain time period, simultaneously in order to further Improve harassing and wrecking Number Reorganization quasi real time and high efficiency, the present invention is also based on different time granularities and calculates user respectively The communication behavior index of number, described time granularity can with value but be not limited to: 1 minute, 5 minutes, 15 minutes, 30 minutes, 60 Minute, 180 minutes, 360 minutes, 720 minutes, 1440 minutes, so, Subscriber Number communication behavior index within a period of time User's communication behavior index, Subscriber Number and ID etc. under each time granularity, wherein, Yong Hubiao can be included Know be used for identifying Subscriber Number be whether harassing and wrecking number (be such as harassing and wrecking numbers by the number mark chosen from blacklist, from The number mark chosen in white list is non-harassing and wrecking number).In the present invention, calculate Subscriber Number based on time granularity and (include Harassing and wrecking and non-harassing and wrecking number and number to be identified) communication behavior index within a period of time, it is also possible to farther include Have:

Step A, collection user's history call bill data of continuous many days, choose multiple time granularity, then look for user and exist High frequency talk period under each time granularity, finally according to leading in user's high frequency talk period under each time granularity Letter behavioral indicator calculates user's communication behavior index under each time granularity.

As in figure 2 it is shown, step A can further include:

Step A1, extract the call bill data of user every day one by one；

Step A2, read beginning and ending time of this day call bill data, and calculate the maximum time that the described beginning and ending time covered Granularity Tmax, when i.e. picking out its value less than corresponding duration maximum of described beginning and ending time from the multiple time granularities chosen Between granularity；

If the call bill data gathered exists disappearance due to loss or other reasons, the most only collect 12:00--- History call bill data between 24:00, then maximum duration granularity Tmax in the beginning and ending time that the present invention only retains call bill data Interior all time granularities；

Step A3, extract each time granularity one by one, and judge extracted time granularity whether less than or equal to Tmax, If it is, the corresponding duration of the beginning and ending time of this day call bill data is divided into multiple continuous print and with the time extracted Granularity is the communication time period of duration, then calculates user's call frequency in each communication time period, and described user is each logical The call frequency in the letter period that is to say the extracted time granularity call frequency at each communication time period of this day, continues to carry Take next time granularity, until having extracted all time granularities；If it is not, then continue to extract next time granularity, until All time granularities are extracted；

The time granularity extracted must be less than or equal to Tmax, when such as Tmax=30 minute, then and the time grain extracted Degree is respectively 1 minute, 5 minutes, 15 minutes, 30 minutes, and the time granularity more than Tmax does not the most remake calculating further, when one The beginning and ending time of it call bill data is 0:00--24:00, when the time granularity T extracted is 30 minutes, and the communication being divided into Period is respectively as follows: 0:00--0:30,0:31--1:00 ...., 23:01--23:30,23:31--24:00, and temporally granularity The communication time period divided is all from the beginning of the 00 of this communication time period second, to terminating for first 1 minute the 59th second of next communication time period；

Step A4, judge whether to have extracted the call bill data in all skies？If it is, continue next step；If it is not, then continue The continuous call bill data extracting next day of user, then turns to step A2；

Step A5, the call frequency of all communication time period in all skies, select maximum from each time granularity, institute Stating communication time period corresponding to maximum is i.e. user's high-frequency communication period under this time granularity, and namely user is the most The communication time period that the frequency is the highest and concentrates is called in it；

Step A6, calculating user's communication behavior index under each time granularity, that is to say that user is at each time grain The communication behavior index in high frequency talk period under Du, described communication behavior index can include but not limited to: the calling frequency, Called number, the duration of call, ring duration, actively discharge number of times, passively release number of times, called dispersion, the quilt of same caller Call out the numbers intersymbol correlation coefficient, call identical No. ten thousand section maximum frequencys, caller accounting (i.e. breathing out number of times/exhalation incoming call total degree), Separation standard call time poor (calculate this index called number and need 3 or more than 3) etc..Wherein, the called number of same caller Intersymbol correlation coefficient is number and the user's calling that all called numbers that user called there are call behavior each other The ratio of all called numbers sum crossed, such as, user A have called 100 quilts in the high frequency talk period of time granularity T Calling out the numbers code, in called number at the appointed time section, (as within the training period of the 1-5 days) has between 4 called numbers B, C, D, E There are call behavior and (such as co-exist in the calling having duration of call ＞=0 5 times: B-＞ C, D-＞ E, C-＞ B, C-＞ D, D- ＞ C), between the called number of the most same caller (i.e. user A), correlation coefficient is: 4/100；Calling identical No. ten thousand section maximum frequencys is That user called and belong to the called number maximum number of identical No. ten thousand sections, No. ten thousand sections are after called number is removed latter 4 Residue section, such as, No. ten thousand sections of called number of user's calling have: 1395193,1395193,1390123,1390438, The called number quantity of identical No. ten thousand sections is respectively 2,1,1, then calling identical No. ten thousand section maximum frequencys is 2.

In step A6, when user's communication behavior index number under each time granularity is a, a time granularity chosen Number is b, then feature (the i.e. communication behavior index) number of the training sample each user of concentration of random forest disaggregated model can be: M=a*b+2.

Random forest disaggregated model has can process higher-dimension attribute data, without doing, feature selection, training speed be fast, instruction Influencing each other between attribute can be detected during white silk, can to realize parallelization, exportable Importance of attribute degree and classification general The advantages such as rate and prediction classification, therefore can choose random forest disaggregated model for identifying harassing and wrecking number.

In the present invention, the basic thought of random forest disaggregated model is: first, utilize have the arbitrary sampling method put back to from Extraction k group sample in original training set (N number of sample M dimension attribute), and often organize the sample size of sample all with original training set phase With, it is N；Secondly, randomly selecting m dimension attribute, m value is less than or equal to total attribute dimension M；Then, N number of sample m dimension is belonged to every time Property generate a decision tree, vertical K decision-tree model of building together, obtain K kind classification results；Finally, according to K kind classification results to often Individual record is voted, and determines that it is finally classified.Therefore, the structure of random forest disaggregated model mainly has two parts, and one Part is the structure of decision tree, formation decision tree forest, and the method that in the present invention, decision tree can use Gini impurity level is impure Spending the least, attribute is the most important；Another part is decision making process, uses the mode of voting to export optimal classification result.So, step In rapid one, described harassing and wrecking and non-harassing and wrecking number and communication behavior index thereof are formed training sample set and builds random forest classification Model, it is also possible to farther included:

Step 11, employing random forest disaggregated model, concentrate M communication behavior of each training sample to refer to from training sample Randomly selecting m communication behavior index in mark to be trained to produce K character subset, thus generate K decision tree, every certainly Plan tree includes self prediction probability to input number generic.Wherein, decision tree can use the side of Gini impurity level Method, impurity level is the least, then feature is the most important.

As it is shown on figure 3, in described step 11, for kth decision tree, k=1,2 ..., K, it is all right that it generates process Farther include:

Step 111, employing bootstrap method have the N number of sample of the extraction put back to constitute kth from training sample set S Sample set s (k) of the root node of decision tree, and set the minimum number of branch's Covering samples as d, wherein, N is training sample The sample number of collection S, in training sample set S, the intrinsic dimensionality (i.e. communication behavior index number) of each training sample is M；

Sample set s (k) of the root node of every decision tree is the most identical with the sample number of training sample set S；

Step 112, random from M dimensional feature extract the m dimension indicator feature set as kth decision tree, wherein, the meter of m Calculation formula may is that

Step 113, from the beginning of the root node of kth decision tree, according to Gini impurity level minimum principle, respectively calculate m dimension The Gini impurity level of feature:Wherein, I_G(j_k) it is that kth is determined The Gini impurity level of the jth dimensional feature of plan tree, Z₁It is that sample set s (k) of root node divides through optimum y-bend under the characteristic condition of j Every the sample set of rear produced left sibling, Z₂It is that sample set s (k) of root node divides through optimum y-bend under the characteristic condition of j Every the sample set of rear produced right node, N (Z₁)、N(Z₂) it is Z respectively₁、Z₂Sample size, Gini (Z₁)、Gini(Z₂) respectively It is Z₁、Z₂Gini impurity level, and Gini (Z₁) computing formula of (I=1 or 2) can also be further:I=0 or 1, when i=0 then represents non-harassing and wrecking number, when i=1 then represents harassing and wrecking number, It is the Z in kth decision tree under jth characteristic condition₁In respective branches, the harassing and wrecking number identified or non-harassing and wrecking number general Rate；

Step 114, from the Gini impurity level of m dimensional feature, choose minima, and using minima characteristic of correspondence as root Node, then this root node is split into left sibling and right node, then using root node as restrictive condition, continue to calculate described The Gini impurity level of the m dimensional feature of node, and select wherein that minima characteristic of correspondence is as root node continued growth, with this type of Push away, thus form decision tree, if wherein have the number of branch's Covering samples to be less than d, then the current of described branch is set Node is leaf node, i.e. this node stops growing, and continues to train other nodes, until all nodes were all trained or quilt It is labeled as leaf node.

As shown in Figure 4, Fig. 1 step 2 can further include:

Step 21, by number to be identified within a period of time communication behavior index input random forest disaggregated model, meter The each leaf node calculating every decision tree treats the prediction probability of identification number generic:Wherein, i =0 or 1, when i=0 is then expressed as non-harassing and wrecking number, when i=1 is then expressed as harassing number,It is the of kth decision tree R leaf node is treated identification number and is belonged to the prediction probability of i-th classification,It is the r leaf of kth decision tree Node belongs to the number number of i-th classification,It it is the number sum that comprises of the r leaf node of kth decision tree；

Step 22, calculate every decision tree and judge the prediction probability of number generic to be identified: Wherein, P_kI () is that kth decision tree judges that number to be identified belongs to the prediction probability of i-th classification, R_kIt it is kth decision tree Leaf node sum；

Step 23, calculate all decision trees judge number generic to be identified prediction probability sum: Wherein, w (i) is that all decision trees judge that number to be identified belongs to the prediction probability sum of i-th classification, the most therefrom selects Big value, number generic to be identified is i.e. the classification that maximum is corresponding, i.e. as w (0) ＞ w (1), belonging to number the most to be identified Classification is 0 (the most non-harassing and wrecking number), and as w (0) ＜ w (1), number generic the most to be identified is 1 (i.e. harassing number).

Random forest K decision tree of disaggregated model stochastic generation, every decision tree includes multiple leaf node.For often For the Subscriber Number of individual input, the leaf node wherein having can judge that Subscriber Number saves as harassing and wrecking number, some leaves Point can judge Subscriber Number as non-harassing and wrecking number, according to all leaf nodes prediction probability to Subscriber Number generic, can With obtain every decision tree judges Subscriber Number as harassing and wrecking number and the probability of non-harassing and wrecking number, and every decision tree judgement user Number is harassing and wrecking number and the probability sum of non-harassing and wrecking number is 1, and such as, the 1st decision tree judges that Subscriber Number is as harassing number The probability of code is 5/6, and the probability for non-harassing and wrecking number is then that the 1/6, the 2nd class decision tree judges that Subscriber Number is as harassing the general of number Rate is 2/7, and the probability of non-harassing and wrecking number is 5/7 ...., the K decision tree judges the Subscriber Number probability as harassing and wrecking number For=3/5, the probability of non-harassing and wrecking number is 2/5, then K decision tree determines that it is probability sum (the i.e. 5/6+ of harassing and wrecking number 2/7+...+3/5) and determine that it is the probability sum (i.e. 1/6+5/7+...+2/5) of non-harassing and wrecking number, if it is harassing and wrecking number The probability sum of code is more than the probability sum of non-harassing and wrecking number, then input number is harassing and wrecking numbers, otherwise is then non-harassing and wrecking number.

Build random forest disaggregated model time, decision tree number, intrinsic dimensionality (i.e. communication behavior index number) and The setting of the degree of depth equivalence of decision tree all can have influence on the recognition effect of random forest disaggregated model, in order to promote identification further Effect, the present invention can also build multiple random forest disaggregated model (degree of depth of decision tree number, intrinsic dimensionality and decision tree Value different), the most also build a test sample collection and respectively multiple random forest disaggregated models tested, and according to Test result therefrom selects an optimum random forest disaggregated model.As it is shown in figure 5, the present invention can also include:

Step B1, concentrate from test sample and extract the communication behavior index of each test sample one by one, and will extract All communication behavior indexs are input in each random forest disaggregated model, thus obtain each random forest disaggregated model to survey The most whether sample is the result of determination harassing number；

Step B2, the harassing and wrecking number identified by each random forest disaggregated model are with confirmed harassing and wrecking number (such as The harassing and wrecking number of Internet firm's mark, the harassing and wrecking number etc. of operator's report and complaint system mark) mate, calculate respectively The accuracy rate of each random forest disaggregated model and recall rate；

Step B3, according to accuracy rate and recall rate, calculate the discrimination of each random forest disaggregated model:Wherein Precision is accuracy rate, and Recall is recall rate, and from all at random The discrimination of forest classified model is selected F maximum, the random forest disaggregated model that described maximum is corresponding be i.e. optimum with Machine forest classified model.

The foregoing is only presently preferred embodiments of the present invention, not in order to limit the present invention, all essences in the present invention Within god and principle, any modification, equivalent substitution and improvement etc. done, within should be included in the scope of protection of the invention.

Claims

1. the recognition methods harassing number, it is characterised in that include:

Step one, choose some confirmed harassing and wrecking and non-harassing and wrecking number, calculate described harassing and wrecking and non-harassing and wrecking number when one section In communication behavior index, then described harassing and wrecking and non-harassing and wrecking number and communication behavior index thereof are formed training sample set Building random forest disaggregated model, the input of described random forest disaggregated model is the communication behavior index of each Subscriber Number, Output is that all decision trees determine that it is harassing and wrecking number and the prediction probability of non-harassing and wrecking number；

Step 2, by the number to be identified communication behavior index input random forest disaggregated model within a period of time, and calculate In random forest disaggregated model, all decision trees determine that it is harassing and wrecking number and the prediction probability of non-harassing and wrecking number, to sentence accordingly Whether fixed described number to be identified is harassing and wrecking numbers.

Method the most according to claim 1, it is characterised in that calculate Subscriber Number communication behavior within a period of time and refer to Mark, has farther included:

Step A, collection user's history call bill data of continuous many days, choose multiple time granularity, then look for user each High frequency talk period under time granularity, finally according to the communication row in user's high frequency talk period under each time granularity User's communication behavior index under each time granularity, Subscriber Number communication behavior within a period of time is calculated for index Index includes but not limited to: user's communication behavior index, Subscriber Number and ID under each time granularity.

Method the most according to claim 2, it is characterised in that described time granularity value but be not limited to: 1 minute, 5 points Clock, 15 minutes, 30 minutes, 60 minutes, 180 minutes, 360 minutes, 720 minutes, 1440 minutes.

Method the most according to claim 2, it is characterised in that step A has farther included:

Step A1, extract the call bill data of user every day one by one；

Step A2, read beginning and ending time of this day call bill data, and calculate the maximum time granularity that the described beginning and ending time covered Tmax, i.e. picks out its value maximum time grain less than the corresponding duration of described beginning and ending time from the multiple time granularities chosen Degree；

Step A3, extract each time granularity one by one, and judge extracted time granularity whether less than or equal to Tmax, if It is then the corresponding duration of the beginning and ending time of this day call bill data to be divided into multiple continuous print and with the time granularity extracted For the communication time period of duration, then calculating user's call frequency in each communication time period, described user is when each communication The call frequency in Duan that is to say the extracted time granularity call frequency at each communication time period of this day, under continuing to extract One time granularity, until having extracted all time granularities；If it is not, then continue to extract next time granularity, until extracting Complete all time granularities；

Step A4, judge whether to have extracted the call bill data in all skies, if it is, continue next step；If it is not, then continue to carry Take the call bill data of next day of family, then turn to step A2；

Step A5, the call frequency of all communication time period in all skies, select maximum from each time granularity, described The communication time period of big value correspondence is i.e. user's high-frequency communication period under this time granularity；

Step A6, calculating user's communication behavior index under each time granularity, that is to say that user is under each time granularity High frequency talk period in communication behavior index.

Method the most according to claim 1, it is characterised in that communication behavior index includes but not limited to: the calling frequency, quilt Cry number, the duration of call, ring duration, actively discharge number of times, passively release number of times, called dispersion, same caller called Correlation coefficient between number, call identical No. ten thousand section maximum frequencys, caller accounting, call time separation standard poor, wherein:

Between the called number of same caller, correlation coefficient is that all called numbers that user called there are call each other The ratio of all called numbers sum that the number of behavior and user called, calling identical No. ten thousand section maximum frequencys is that user exhales That cried and belong to the called number maximum number of identical No. ten thousand sections, wherein, No. ten thousand sections are after called number is removed latter 4 Residue section.

Method the most according to claim 1, it is characterised in that in step one, by described harassing and wrecking and non-harassing and wrecking number and Communication behavior index forms training sample set and builds random forest disaggregated model, has farther included:

Step 11, employing random forest disaggregated model, concentrate M communication behavior index of each training sample from training sample Randomly select m communication behavior index to be trained to produce K character subset, thus generate K decision tree, every decision tree Including self prediction probability to input number generic, wherein, decision tree uses the method for Gini impurity level to build.

Method the most according to claim 6, it is characterised in that in described step 11, for kth decision tree, it generates Process has farther included:

Step 111, employing bootstrap method have the N number of sample of the extraction put back to constitute kth certainly from training sample set S Sample set s (k) of the root node of plan tree, and set the minimum number of branch's Covering samples as d, wherein, N is training sample set S Sample number, in training sample set S, the intrinsic dimensionality of each training sample is M；

Step 112, random from M dimensional feature extract the m dimension indicator feature set as kth decision tree；

Step 113, from the beginning of the root node of kth decision tree, according to Gini impurity level minimum principle, respectively calculate m dimensional feature Gini impurity level:Wherein, I_G(j_k) it is kth decision tree The Gini impurity level of jth dimensional feature, Z₁Be sample set s (k) of root node under the characteristic condition of j after optimum y-bend separates institute The sample set of the left sibling produced, Z₂Be sample set s (k) of root node under the characteristic condition of j after optimum y-bend separates institute The sample set of the right node produced, N (Z₁)、N(Z₂) it is Z respectively₁、Z₂Sample size, Gini (Z₁)、Gini(Z₂) it is Z respectively₁、Z₂ Gini impurity level；

Step 114, from the Gini impurity level of m dimensional feature, choose minima, and using minima characteristic of correspondence as root node, This root node is split into left sibling and right node again, then using root node as restrictive condition, continues to calculate described root node The Gini impurity level of m dimensional feature, and select wherein that minima characteristic of correspondence is as root node continued growth, by that analogy, Thus form decision tree, if wherein have the number of branch's Covering samples to be less than d, then the present node of described branch is set Stop growing for leaf node, i.e. this node, continue to train other nodes, until all nodes were all trained or were labeled For leaf node.

Method the most according to claim 7, it is characterised in that the computing formula of m is:Gini(Z_l) Computing formula is:Wherein, l=1 or 2, i=0 or 1, when i=0 then represents non-harassing and wrecking number, work as i =1 represents harassing and wrecking number,It is the Z in kth decision tree under jth characteristic condition_lIn respective branches, identified disturbs Disturb number or the probability of non-harassing and wrecking number.

Method the most according to claim 1, it is characterised in that step 2 has farther included:

Step 21, by the number to be identified communication behavior index input random forest disaggregated model within a period of time, calculate every Each leaf node of decision tree treats the prediction probability of identification number generic:Wherein, i=0 or 1, when i=0 is then expressed as non-harassing and wrecking number, when i=1 is then expressed as harassing number,It is the r leaf of kth decision tree Child node is treated identification number and is belonged to the prediction probability of i-th classification,It is in the r leaf node of kth decision tree Belong to the number number of i-th classification,It it is the number sum that comprises of the r leaf node of kth decision tree；

Step 22, calculate every decision tree and judge the prediction probability of number generic to be identified:Its In, P_kI () is that kth decision tree judges that number to be identified belongs to the prediction probability of i-th classification, R_kIt it is the leaf of kth decision tree Child node sum；

Step 23, calculate all decision trees judge number generic to be identified prediction probability sum:Its In, w (i) is that all decision trees judge that number to be identified belongs to the prediction probability sum of i-th classification, the most therefrom selects maximum Value, number generic to be identified is i.e. the classification that maximum is corresponding.

Method the most according to claim 1, it is characterised in that also include:

Step B1, concentrate from test sample and extract the communication behavior index of each test sample one by one, and all by extract Communication behavior index is input in each random forest disaggregated model, thus obtains each random forest disaggregated model to test specimens Whether this is the result of determination harassing number；

Step B2, the harassing and wrecking number identified by each random forest disaggregated model mate with confirmed harassing and wrecking number, Calculate accuracy rate and the recall rate of each random forest disaggregated model respectively；

Step B3, according to accuracy rate and recall rate, calculate the discrimination of each random forest disaggregated model: Wherein Precision is accuracy rate, and Recall is recall rate, and selects F from the discrimination of all random forest disaggregated models Maximum, the random forest disaggregated model that described maximum is corresponding is i.e. optimal stochastic forest classified model.