CN110147430A

CN110147430A - Harassing call recognition methods and system based on random forests algorithm

Info

Publication number: CN110147430A
Application number: CN201910339683.6A
Authority: CN
Inventors: 周红敏; 祝敬安; 王红熳; 韦红; 丁正; 顾晓东; 张飞; 贾岩峰; 刘艳
Original assignee: SHANGHAI XINFANG SOFTWARE Co Ltd; BEIJING XINFANG INTELLIGENT SYSTEM CO LTD
Current assignee: SHANGHAI XINFANG SOFTWARE Co Ltd; BEIJING XINFANG INTELLIGENT SYSTEM CO LTD
Priority date: 2019-04-25
Filing date: 2019-04-25
Publication date: 2019-08-20

Abstract

The invention discloses harassing call recognition methods and system based on random forests algorithm, generate new samples collection after carrying out over-sampling to the raw data set using SMOTE algorithm；Harassing call identification model is constructed according to the new samples collection, and initializes random forest parameter in the harassing call identification model；It is verified after training harassing call identification model using ten folding cross validations, and calculates its recognition effect assessed value；Obtain optimal harassing call identification model, using pickle successively will optimal harassing call identification model serializing and unserializing after construct API, by optimal harassing call identification model with interface mode dispose it is online；When message registration to be measured reaches, api interface is called, optimal harassing call identification model is entered data into and is predicted.The invention has the advantages that: not only effectively avoiding behavior of manually judging by accident and fail to judge, but also the error of model is reduced, meanwhile, improve the availability and practicability of model.

Description

Harassing call recognition methods and system based on random forests algorithm

Technical field

The present invention relates to natural language processing fields, it particularly relates to a kind of harassing and wrecking electricity based on random forests algorithm Talk about recognition methods and system.

Background technique

Harassing call is promoted the sale of products or some rows for pretending to be the police, bank clerk deliberately to make nuisance calls For harassing call has very strong interference, temptation and duplicity, and is easy camouflage, and it is frequently and not fragile to dial number Case has seriously endangered the normal life and individual privacy of the people.

For the problems in the relevant technologies, currently no effective solution has been proposed.

Summary of the invention

For above-mentioned technical problem in the related technology, the present invention proposes a kind of harassing call based on random forests algorithm Recognition methods and system effectively can quickly identify harassing call, effectively solve the problems, such as harassing call of artificially judging by accident and fail to judge.

To realize the above-mentioned technical purpose, the technical scheme of the present invention is realized as follows:

A kind of harassing call recognition methods based on random forests algorithm, comprising the following steps:

Raw data set is handled, determines the distribution proportion of positive and negative harassing call sample；

For unbalanced harassing call sample, using SMOTE algorithm to raw after raw data set progress over-sampling At new samples collection, equilibrium data distribution；

According to the new samples collection construct harassing call identification model, and initialize in the harassing call identification model with Machine forest parameters, the input of setting random forest parameter, output variable；

It is verified after training harassing call identification model using ten folding cross validations, and calculates the assessment of its recognition effect Value；

Optimal harassing call identification model is obtained using web search, improves the precision of random forest, improves training effect Rate successively will construct API after the serializing of optimal harassing call identification model and unserializing using pickle, by optimal harassing and wrecking electricity It is online with interface mode deployment to talk about identification model；

When message registration to be measured reaches, api interface is called, optimal harassing call identification model is entered data into and carries out in advance It surveys.

Further, generation new samples collection includes: after carrying out over-sampling to the raw data set using SMOTE algorithm

The harassing call sample of minority class is analyzed, and is added according to the artificial synthesized new samples of harassing call sample It is added to initial data concentration；

The harassing call sample of each minority class, randomly selects several first samples from its arest neighbors；

The second sample is randomly selected on harassing call sample and the line of first sample.

Further, harassing call identification model is constructed for the new samples collection, and initializes the harassing call and knows Random forest parameter includes: in other model

Be arranged random forest parameter, wherein the random forest parameter include the number of decision tree, have the sampling put back to, The depth capacity of the feature and tree that are divided when information gain, most suitable attribute；

The information gain of computation attribute chooses most suitable node, and child node computes repeatedly information gain, and chooses information Gain maximum node, successively opinion pushes away, and generates more trees, and the calculation formula of the information gain is as follows:

G (D, A)=H (D)-H (D | A)

Wherein, H (D) is empirical entropy, and H (D | A) is the empirical condition entropy of selected feature A；

Random forest is constructed according to random forest parameter and information gain value, utilizes random forests algorithm more decisions of training Tree generates harassing call identification model.

Further, this method further includes that the recognition effect assessed value comments the harassing call identification model Estimate, wherein the recognition effect assessed value includes rate of precision, recall rate and F1-score, and calculation formula difference is as follows:

Precision (rate of precision)=TP/ (TP+FP)

Recall (recall rate)=TP/ (TP+FN)

F1-score=2*Precision*Recall/ (Precision+Recall)

Wherein, TP representative sample is positive, and the number that prediction result is positive, FP representative sample is negative, what prediction result was positive Number, FN representative sample are positive, the number that prediction result is negative.

As shown in Fig. 2, another aspect of the present invention, provides a kind of harassing call identification system based on random forests algorithm System, comprising:

Determining module determines the distribution proportion of positive and negative harassing call sample for handling raw data set；

Generation module, for generating new samples collection after carrying out over-sampling to the raw data set using SMOTE algorithm；

First building module for constructing harassing call identification model according to the new samples collection, and is disturbed described in initialization Disturb random forest parameter in phone identification model；

Authentication module for being verified after training harassing call identification model using ten folding cross validations, and calculates it Recognition effect assessed value；

Module is obtained successively to know optimal harassing call using pickle for obtaining optimal harassing call identification model API is constructed after other Model sequence and unserializing, optimal harassing call identification model is online with interface mode deployment；

Identification module when reaching for message registration to be measured, calls api interface, enters data into optimal harassing call Identification model is predicted.

Further, the generation module includes:

Analysis module is analyzed for the harassing call sample to minority class, and according to the harassing call sample people Work synthesis new samples are added to initial data concentration；

First chooses module, and for the harassing call sample of each minority class, several the are randomly selected from its arest neighbors One sample；

Second chooses module, for randomly selecting the second sample on the line of harassing call sample and first sample.

Further, the first building module includes:

Parameter setting module, for random forest parameter to be arranged, wherein the random forest parameter includes of decision tree The depth capacity of the feature and tree that are divided when counting, having the sampling put back to, information gain, most suitable attribute；

First computing module chooses most suitable node for the information gain of computation attribute, and child node computes repeatedly letter Gain is ceased, and chooses information gain maximum node, successively opinion pushes away, and generates more trees, and the calculation formula of the information gain is as follows:

G (D, A)=H (D)-H (D | A)

Second building module, for constructing random forest according to random forest parameter and information gain value, using random gloomy Woods algorithm more decision trees of training generate harassing call identification model.

Further, system further include: evaluation module knows the harassing call for the recognition effect assessed value Other model is assessed, wherein the recognition effect assessed value includes rate of precision, recall rate and F1-score, calculation formula It is as follows respectively:

Precision (rate of precision)=TP/ (TP+FP)

Recall (recall rate)=TP/ (TP+FN)

F1-score=2*Precision*Recall/ (Precision+Recall)

Beneficial effects of the present invention: behavior of manually judging by accident and fail to judge not only effectively is avoided, but also reduces the error of model, together When, improve the availability and practicability of model.

Detailed description of the invention

It in order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, below will be to institute in embodiment Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the invention Example, for those of ordinary skill in the art, without creative efforts, can also obtain according to these attached drawings Obtain other attached drawings.

Fig. 1 is the flow chart of the harassing call recognition methods based on random forests algorithm described according to embodiments of the present invention One of；

Fig. 2 is the structure of the harassing call identifying system based on random forests algorithm described according to embodiments of the present invention Figure；

Fig. 3 is the user's loan repayment capacity prediction single tree tree construction schematic diagram described according to embodiments of the present invention；

Fig. 4 is the blind date prediction single tree tree construction schematic diagram described according to embodiments of the present invention；

Fig. 5 is the motion prediction single tree tree construction schematic diagram described according to embodiments of the present invention；

Fig. 6 is middle city of Jiangsu province harassing call random forest single tree structural schematic diagram according to embodiments of the present invention；

Fig. 7 is middle Hebei province city's harassing call random forest single tree structural schematic diagram according to embodiments of the present invention.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, those of ordinary skill in the art's every other embodiment obtained belong to what the present invention protected Range.

As shown in Figure 1, a kind of harassing call identification side based on random forests algorithm described according to embodiments of the present invention Method, comprising the following steps:

Data set D is a month message registration (encryption), and initial data dimension is 49 dimensions, is spread out by data cleansing, variable Input variable x1, x2, x3 ... the x14 of random forest is obtained after raw and Feature Selection, detailed description are as follows for variable:

X1: max time point, caller dials ticket sum are taken in the number of calls；

X2: max time point is taken in the number of calls, calling number dials different called number sums；

X3: max time point, the total duration of call of calling number are taken in the number of calls；

X4: max time point, calling number duration of call maximum value are taken in the number of calls；

X5: max time point, calling number duration of call mean value are taken in the number of calls；

X6: max time point, calling number duration of call standard deviation are taken in the number of calls；

X7: max time point is taken in the number of calls, the calling number duration of call is 0 ticket number；

X8: max time point is taken in the number of calls, the calling number duration of call is not 0 ticket number；

X9: max time point, the non-zero ticket number number of the calling number duration of call are taken in the number of calls；

X10: max time point, calling number local call total degree are taken in the number of calls；

X11: max time point, calling number other places call count are taken in the number of calls；

X12: max time point, calling number ring duration maximum value are taken in the number of calls；

X13: max time point, calling number ring duration mean value are taken in the number of calls；

X14: max time point, calling number ring duration standard deviation are taken in the number of calls.

New samples collection is generated after carrying out over-sampling to the raw data set using SMOTE algorithm；

According to the new samples collection construct harassing call identification model, and initialize in the harassing call identification model with Machine forest parameters；

It is the different numerical value of each parameter setting in conjunction with random forest parameter, using 10 folding cross validations and grid search, Continuous fitting data and training harassing call identification model, and export harassing call identification model trained every time and assess parameter.

Obtain optimal harassing call identification model, using pickle successively by optimal harassing call identification model serializing and API is constructed after unserializing, optimal harassing call identification model is online with interface mode deployment；

Specifically, Original CDR data set (encryption) is handled, the distribution proportion of positive negative sample is determined；Using SMOTE algorithm Over-sampling is carried out to raw data set, forms final new samples collection, and new samples collection is classified, training set and test Collection；Initialize random forest parameter in the harassing call identification model；Using the training harassing call identification of ten folding cross validations Model is verified using test set, and calculates the rate of precision, recall rate and F1 score of harassing call identification model；Using Grid search obtains optimal harassing call identification model, and is serialized harassing call identification model using pickle, saves To server；Using pickle by harassing call identification model unserializing, and Flask framework establishment API is utilized, is connect with API The mode of mouth, the harassing call identification model deployment for meeting business demand is online, realize the swindle real-time predictive intercept of short-term；To When surveying message registration arrival, api interface is called, is entered data into harassing call identification model to realize that harassing call is pre- It surveys, returns to prediction result to after the prediction of harassing call identification model, i.e., if it is harassing call, api interface returns to 1, if It is normal telephone, api interface returns to 0.

In one particular embodiment of the present invention, after carrying out over-sampling to the raw data set using SMOTE algorithm Generating new samples collection includes:

That is, the harassing call sample A to minority class is analyzed, and artificial synthesized according to the harassing call sample of minority class New samples are added to initial data concentration；The harassing call sample a of each minority class randomly chooses one from its arest neighbors Then first sample b randomly selects a point as newly synthesized minority class sample (the second sample), tool on the line of a, b The algorithm steps of body are as follows:

(1) for each sample x in minority class, using Euclidean distance as criterion calculation, it owns into minority class sample set The distance of sample obtains k neighbour, wherein Euclidean distance d is calculated as shown in (1):

Wherein, the dimension of N representative sample data, x_1iRepresent first sample, i-th of dimension, x_2iRepresent second sample I-th of dimension.

(2) for each minority class sample x, several samples are randomly choosed from its k neighbour, it is assumed that the neighbour of selection For xn；

(3) for the neighbour xn that each is selected at random, stochastic linear interpolation is carried out, constructs new sample with original sample respectively This；

(4) new samples are put into former data, generate new training set and forms final new samples after SMOTE is sampled Collection, wherein new samples collection includes training sample and test sample.

In one particular embodiment of the present invention, harassing call identification model is constructed for the new samples collection, and just Random forest parameter includes: in the beginningization harassing call identification model

Be arranged random forest parameter, wherein the random forest parameter include the number of decision tree, have the sampling put back to, The depth capacity of the feature and tree that divide when information gain, most suitable attribute, information gain are for carrying out in decision-tree model The information gain of the index of feature selecting, some feature is bigger, then the selectivity of this feature is better.

Random forest parameter and parameter interpretation are as follows:

N_estimators=60: the number of decision tree；

Bootstrap=True: there is the sampling put back to；

Criterion=entropy: the information gain of computation attribute, to select most suitable node；

Max_features=sqrt: the feature divided when selection most suitable attribute is no more than this value；

Max_depth=8: the depth capacity of tree；

Min_samples_split=20: each to divide least sample number when according to Attribute transposition node；

Min_samples_leaf=100: the least sample number of leaf node；

G (D, A)=H (D)-H (D | A) (2)

Wherein, H (D) is empirical entropy, and H (D | A) is the empirical condition entropy of selected feature A, and calculation formula is respectively such as formula (3), shown in formula (4):

Training data set D, | D | it is sample size, the i.e. number (element number in D) of sample, is equipped with K class C_kCome It indicates, | C_k| it is C_iNumber of samples, | C_k| the sum of be | D |, D is divided into n subset D according to feature A by k=1,2 ... ..₁, D₂.....D_n, | D_i| it is D_iNumber of samples, | D_i| the sum of be | D |, i=1,2 ..., remember D_iIn belong to C_kSample set be D_ik, i.e. intersection, | D_ik| it is D_ikNumber of samples.

Random forest is, random forest building during needs two parameters built-up by more decision trees, The number t of decision tree, the number m of input feature vector in need of consideration in each node split of decision tree, wherein single decision The building process of tree is as follows:

(1) enabling N is the number of training examples, then the number of the input sample of single decision tree is concentrated with to be N number of from training That puts back to randomly selects N number of training examples；

(2) number for enabling the input feature vector of training examples is M (M=14), cuts m and is far smaller than M, then in every decision tree When the enterprising line splitting of each node, m input feature vector is randomly choosed in M input feature vector, then in this m input feature vector The selection maximum node of information gain is divided, wherein m will not change during constructing decision tree；

(3) every decision tree all go down so always by division, until all training examples of the node belong to same class, Beta pruning is not needed.

Decision tree is a kind of algorithm that decision is carried out using tree structure, according to known conditions or is spy for sample data Sign carries out bifurcated, finally establishes one tree, and the leaf tubercle of tree identifies final decision, and new data can be according to this tree Judged.Random forest is a kind of algorithm that decision is optimized by more decision trees, there is the selection put back to training at random Data then structural classification device, finally with ensemble learning to model increase whole classifying quality.

As shown in figure 3, user's loan repayment capacity is predicted, whether owned a house by client, if get married, average monthly income is pre- Survey whether loan user has the ability repaid the loan.Each internal tubercle indicates an attribute conditions judgement, leaf knot Section indicates whether loan user has repaying ability.When decision tree selects feature, the maximum feature of information gain value should be selected, As the tubercle splitting condition, the information gain value of other each features is calculated with this, forms more trees, finally there is voting mechanism Judge whether the client has the ability repaid the loan.

As shown in figure 4, blind date prediction, by the essential characteristic for the boy that blindly dates, such as: moral standing, wealth, work, appearance, prediction Whether girl, which goes, is blindly dated.Each internal tubercle indicates the condition judgement an of attribute, and leaf node indicates that girl chooses whether Blind date；When decision tree selects feature, the information gain value of each feature is calculated first, and by the information gain value of each feature Descending sort is carried out, selects the maximum feature of information gain value as root node, calculates the information gain of other nodes, and select The maximum feature of information gain carries out second division, and so on repeatedly divided, more trees are formed, finally by gloomy at random The voting mechanism of woods provides whether the girl goes to blindly date.

As shown in figure 5, motion prediction, such as by given meteorological data: situations such as humidity, wind-force and weather forecast, Prediction can or can not go out to play ball for gains in depth of comprehension one day, respectively using humidity wind-force, weather forecast as root node, calculate its information gain, Selecting the maximum weather forecast of information gain is root node, and child node computes repeatedly information gain, and chooses information gain maximum Node be next root node, according to this opinion push away, until node cannot divide, generate more tree, finally by random forest Voting mechanism predict under a certain weather, if go to play ball.

More decision trees can be generated during the training of harassing call identification model, to a each new test sample, The classification results of comprehensive more decision trees, classification knot of the classification for taking single tree classification results most as entire random forest Fruit.

In one particular embodiment of the present invention, this method further includes that the recognition effect assessed value is to the harassing and wrecking Phone identification model is assessed, wherein the recognition effect assessed value includes rate of precision, recall rate and F1-score, meter It is as follows to calculate formula difference:

Precision (rate of precision)=TP/ (TP+FP) (5)

Recall (recall rate)=TP/ (TP+FN) (6)

F1-score=2*Precision*Recall/ (Precision+Recall) (7)

In one particular embodiment of the present invention, the generation module includes:

In one particular embodiment of the present invention, the first building module includes:

G (D, A)=H (D)-H (D | A) (2)

In one particular embodiment of the present invention, system further include: evaluation module is assessed for the recognition effect Value assesses the harassing call identification model, wherein the recognition effect assessed value include rate of precision, recall rate and F1-score, calculation formula difference are as follows:

Precision (rate of precision)=TP/ (TP+FP) (5)

Recall (recall rate)=TP/ (TP+FN) (6)

F1-score=2*Precision*Recall/ (Precision+Recall) (7)

In order to facilitate understanding above-mentioned technical proposal of the invention, below by way of in specifically used mode to of the invention above-mentioned Technical solution is described in detail.

Embodiment one

Data be city of Jiangsu province Communications Administration Bureau user bill data, data dimension x1, x2, x3, x4, x5, x6, X7, x8, x9, x10, x11, x12, x13, x14 totally 14 dimension datas, by taking single encrypted data as an example, each dimension numerical value is- 0.049、-0.059、-0.270、- 0.264、-0.339、-0.079、0.052、0.039、0.055、-0.052、0.092、- 0.247, -0.042, -0.057, WEB terminal calls api interface, and data are transported in harassing call identification model, and data enter In harassing call identification model, information gain can be calculated by root node of each node, the maximum spy of information gain value should be selected Sign calculates the information gain value of other each features with this as root node, forms more trees, finally there is random forest ballot Mechanism judges whether the calling number is harassing call, and is returned the result in a manner of JSON, returns to 1 if it is harassing call, 0 is returned if not harassing call.

As shown in fig. 6, root node is each data dimension, line is Rule of judgment, and leaf node is the output of single tree Target, during prediction, each corresponding node successively judges according to condition, until reaching last leaf node, single tree Judgement terminates, this single tree is with x1 (- 0.049) for root node, and x1<-0.019, left branch judges x2 (- 0.059), and x2>- 0.220, leaf node is " no ", single tree judges that this is recorded as " no ", leaf node is in whole random forest " It is no " ratio be less than leaf node " be " and ratio, so this final data is judged as YES harassing call, model output 1, interface returns to 1 and calls end to WEB.

Embodiment two

Data be Communications Administration Bureau, city, Hebei province user bill data, data dimension x1, x2, x3, x4, x5, x6, X7, x8, x9, x10, x11, x12, x13, x14 totally 14 dimension datas, by taking single encrypted data as an example, each dimension numerical value is- 0.149、-0.259、-0.170、- 0.364、-0.239、-0.479、0.152、0.239、0.155、-0.152、0.194、- 0.127,- 0.542,-0.257；WEB terminal calls api interface, and data are transported in harassing call identification model, and data enter In harassing call identification model, information gain can be calculated by root node of each node, the maximum spy of information gain value should be selected Sign calculates the information gain value of other each features with this as root node, forms more trees, finally there is random forest ballot Mechanism judges whether the calling number is harassing call, and is returned the result in a manner of JSON, returns to 1 if it is harassing call, 0 is returned if not harassing call.

As shown in fig. 7, each corresponding node successively judges according to condition during the prediction of harassing call identification model, Until reaching last leaf node, single tree judgement terminates, the single tree with x3 (- 0.170) for root node, x3 < -0.02, Left branch judges x4 (- 0.364) that x4<-0.218, left branch judges x9 (0.155), x9>-0.335, and leaf node is " no ", Single tree judges that this is recorded as " it is no ", leaf node is in whole random forest " it is no " ratio be greater than leaf node For " be " ratio, so this final data is judged as it is not harassing call, model output 0, interface returns to 0 and calls to WEB End.

In conclusion behavior of manually judging by accident and fail to judge not only effectively is avoided by means of above-mentioned technical proposal of the invention, and And the error of model is reduced, meanwhile, improve the availability and practicability of model.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Within mind and principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Claims

1. a kind of harassing call recognition methods based on random forests algorithm, which comprises the following steps:

Harassing call identification model is constructed according to the new samples collection, and is initialized random gloomy in the harassing call identification model Woods parameter；

It is verified after training harassing call identification model using ten folding cross validations, and calculates its recognition effect assessed value；

Optimal harassing call identification model is obtained, using pickle successively by the serializing of optimal harassing call identification model and inverted sequence API is constructed after columnization, optimal harassing call identification model is online with interface mode deployment；

When message registration to be measured reaches, api interface is called, optimal harassing call identification model is entered data into and is predicted.

2. the harassing call recognition methods according to claim 1 based on random forests algorithm, which is characterized in that utilize Generation new samples collection includes: after SMOTE algorithm carries out over-sampling to the raw data set

The harassing call sample of minority class is analyzed, and is added to according to the artificial synthesized new samples of harassing call sample Initial data is concentrated；

3. the harassing call recognition methods according to claim 1 based on random forests algorithm, which is characterized in that be directed to institute New samples collection building harassing call identification model is stated, and initializes random forest parameter packet in the harassing call identification model It includes:

Random forest parameter is set, wherein the random forest parameter includes the number of decision tree, has the sampling put back to, information The depth capacity of the feature and tree that are divided when gain, most suitable attribute；

G (D, A)=H (D)-H (D | A)

Random forest is constructed according to random forest parameter and information gain value, it is raw using random forests algorithm more decision trees of training At harassing call identification model.

4. the harassing call recognition methods according to claim 1-3 based on random forests algorithm, feature exist In this method further includes that the recognition effect assessed value assesses the harassing call identification model, wherein the knowledge Other recruitment evaluation value includes rate of precision, recall rate and F1-score, and calculation formula difference is as follows:

Precision (rate of precision)=TP/ (TP+FP)

Recall (recall rate)=TP/ (TP+FN)

F1-score=2*Precision*Recall/ (Precision+Recall)

Wherein, TP representative sample is positive, and the number that prediction result is positive, FP representative sample is negative, the number that prediction result is positive, FN representative sample is positive, the number that prediction result is negative.

5. a kind of harassing call identifying system based on random forests algorithm characterized by comprising

First building module for constructing harassing call identification model according to the new samples collection, and initializes the harassing and wrecking electricity Talk about random forest parameter in identification model；

Authentication module for being verified after training harassing call identification model using ten folding cross validations, and calculates its identification Recruitment evaluation value；

Module is obtained, for obtaining optimal harassing call identification model, optimal harassing call is successively identified into mould using pickle API is constructed after type serializing and unserializing, optimal harassing call identification model is online with interface mode deployment；

6. the harassing call identifying system according to claim 5 based on random forests algorithm, which is characterized in that the life Include: at module

Analysis module is analyzed for the harassing call sample to minority class, and is manually closed according to the harassing call sample It is added to initial data at new samples to concentrate；

First selection module randomly selects several first samples for the harassing call sample of each minority class from its arest neighbors This；

7. the harassing call identifying system according to claim 5 based on random forests algorithm, which is characterized in that described One, which constructs module, includes:

Parameter setting module, for random forest parameter to be arranged, wherein the random forest parameter include decision tree number, The depth capacity of the feature and tree that are divided when having the sampling put back to, information gain, most suitable attribute；

First computing module chooses most suitable node for the information gain of computation attribute, and child node computes repeatedly information increasing Benefit, and information gain maximum node is chosen, successively opinion pushes away, and generates more trees, and the calculation formula of the information gain is as follows:

G (D, A)=H (D)-H (D | A)

Second building module is calculated for constructing random forest according to random forest parameter and information gain value using random forest Method more decision trees of training generate harassing call identification model.

8. according to the described in any item harassing call identifying systems based on random forests algorithm of claim 5-7, feature exists In the system further include: evaluation module comments the harassing call identification model for the recognition effect assessed value Estimate, wherein the recognition effect assessed value includes rate of precision, recall rate and F1-score, and calculation formula difference is as follows:

Precision (rate of precision)=TP/ (TP+FP)

Recall (recall rate)=TP/ (TP+FN)

F1-score=2*Precision*Recall/ (Precision+Recall)