CN110147430A - Harassing call recognition methods and system based on random forests algorithm - Google Patents

Harassing call recognition methods and system based on random forests algorithm Download PDF

Info

Publication number
CN110147430A
CN110147430A CN201910339683.6A CN201910339683A CN110147430A CN 110147430 A CN110147430 A CN 110147430A CN 201910339683 A CN201910339683 A CN 201910339683A CN 110147430 A CN110147430 A CN 110147430A
Authority
CN
China
Prior art keywords
harassing call
identification model
sample
call identification
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910339683.6A
Other languages
Chinese (zh)
Inventor
周红敏
祝敬安
王红熳
韦红
丁正
顾晓东
张飞
贾岩峰
刘艳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SHANGHAI XINFANG SOFTWARE Co Ltd
BEIJING XINFANG INTELLIGENT SYSTEM CO LTD
Original Assignee
SHANGHAI XINFANG SOFTWARE Co Ltd
BEIJING XINFANG INTELLIGENT SYSTEM CO LTD
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SHANGHAI XINFANG SOFTWARE Co Ltd, BEIJING XINFANG INTELLIGENT SYSTEM CO LTD filed Critical SHANGHAI XINFANG SOFTWARE Co Ltd
Priority to CN201910339683.6A priority Critical patent/CN110147430A/en
Publication of CN110147430A publication Critical patent/CN110147430A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses harassing call recognition methods and system based on random forests algorithm, generate new samples collection after carrying out over-sampling to the raw data set using SMOTE algorithm;Harassing call identification model is constructed according to the new samples collection, and initializes random forest parameter in the harassing call identification model;It is verified after training harassing call identification model using ten folding cross validations, and calculates its recognition effect assessed value;Obtain optimal harassing call identification model, using pickle successively will optimal harassing call identification model serializing and unserializing after construct API, by optimal harassing call identification model with interface mode dispose it is online;When message registration to be measured reaches, api interface is called, optimal harassing call identification model is entered data into and is predicted.The invention has the advantages that: not only effectively avoiding behavior of manually judging by accident and fail to judge, but also the error of model is reduced, meanwhile, improve the availability and practicability of model.

Description

Harassing call recognition methods and system based on random forests algorithm
Technical field
The present invention relates to natural language processing fields, it particularly relates to a kind of harassing and wrecking electricity based on random forests algorithm Talk about recognition methods and system.
Background technique
Harassing call is promoted the sale of products or some rows for pretending to be the police, bank clerk deliberately to make nuisance calls For harassing call has very strong interference, temptation and duplicity, and is easy camouflage, and it is frequently and not fragile to dial number Case has seriously endangered the normal life and individual privacy of the people.
For the problems in the relevant technologies, currently no effective solution has been proposed.
Summary of the invention
For above-mentioned technical problem in the related technology, the present invention proposes a kind of harassing call based on random forests algorithm Recognition methods and system effectively can quickly identify harassing call, effectively solve the problems, such as harassing call of artificially judging by accident and fail to judge.
To realize the above-mentioned technical purpose, the technical scheme of the present invention is realized as follows:
A kind of harassing call recognition methods based on random forests algorithm, comprising the following steps:
Raw data set is handled, determines the distribution proportion of positive and negative harassing call sample;
For unbalanced harassing call sample, using SMOTE algorithm to raw after raw data set progress over-sampling At new samples collection, equilibrium data distribution;
According to the new samples collection construct harassing call identification model, and initialize in the harassing call identification model with Machine forest parameters, the input of setting random forest parameter, output variable;
It is verified after training harassing call identification model using ten folding cross validations, and calculates the assessment of its recognition effect Value;
Optimal harassing call identification model is obtained using web search, improves the precision of random forest, improves training effect Rate successively will construct API after the serializing of optimal harassing call identification model and unserializing using pickle, by optimal harassing and wrecking electricity It is online with interface mode deployment to talk about identification model;
When message registration to be measured reaches, api interface is called, optimal harassing call identification model is entered data into and carries out in advance It surveys.
Further, generation new samples collection includes: after carrying out over-sampling to the raw data set using SMOTE algorithm
The harassing call sample of minority class is analyzed, and is added according to the artificial synthesized new samples of harassing call sample It is added to initial data concentration;
The harassing call sample of each minority class, randomly selects several first samples from its arest neighbors;
The second sample is randomly selected on harassing call sample and the line of first sample.
Further, harassing call identification model is constructed for the new samples collection, and initializes the harassing call and knows Random forest parameter includes: in other model
Be arranged random forest parameter, wherein the random forest parameter include the number of decision tree, have the sampling put back to, The depth capacity of the feature and tree that are divided when information gain, most suitable attribute;
The information gain of computation attribute chooses most suitable node, and child node computes repeatedly information gain, and chooses information Gain maximum node, successively opinion pushes away, and generates more trees, and the calculation formula of the information gain is as follows:
G (D, A)=H (D)-H (D | A)
Wherein, H (D) is empirical entropy, and H (D | A) is the empirical condition entropy of selected feature A;
Random forest is constructed according to random forest parameter and information gain value, utilizes random forests algorithm more decisions of training Tree generates harassing call identification model.
Further, this method further includes that the recognition effect assessed value comments the harassing call identification model Estimate, wherein the recognition effect assessed value includes rate of precision, recall rate and F1-score, and calculation formula difference is as follows:
Precision (rate of precision)=TP/ (TP+FP)
Recall (recall rate)=TP/ (TP+FN)
F1-score=2*Precision*Recall/ (Precision+Recall)
Wherein, TP representative sample is positive, and the number that prediction result is positive, FP representative sample is negative, what prediction result was positive Number, FN representative sample are positive, the number that prediction result is negative.
As shown in Fig. 2, another aspect of the present invention, provides a kind of harassing call identification system based on random forests algorithm System, comprising:
Determining module determines the distribution proportion of positive and negative harassing call sample for handling raw data set;
Generation module, for generating new samples collection after carrying out over-sampling to the raw data set using SMOTE algorithm;
First building module for constructing harassing call identification model according to the new samples collection, and is disturbed described in initialization Disturb random forest parameter in phone identification model;
Authentication module for being verified after training harassing call identification model using ten folding cross validations, and calculates it Recognition effect assessed value;
Module is obtained successively to know optimal harassing call using pickle for obtaining optimal harassing call identification model API is constructed after other Model sequence and unserializing, optimal harassing call identification model is online with interface mode deployment;
Identification module when reaching for message registration to be measured, calls api interface, enters data into optimal harassing call Identification model is predicted.
Further, the generation module includes:
Analysis module is analyzed for the harassing call sample to minority class, and according to the harassing call sample people Work synthesis new samples are added to initial data concentration;
First chooses module, and for the harassing call sample of each minority class, several the are randomly selected from its arest neighbors One sample;
Second chooses module, for randomly selecting the second sample on the line of harassing call sample and first sample.
Further, the first building module includes:
Parameter setting module, for random forest parameter to be arranged, wherein the random forest parameter includes of decision tree The depth capacity of the feature and tree that are divided when counting, having the sampling put back to, information gain, most suitable attribute;
First computing module chooses most suitable node for the information gain of computation attribute, and child node computes repeatedly letter Gain is ceased, and chooses information gain maximum node, successively opinion pushes away, and generates more trees, and the calculation formula of the information gain is as follows:
G (D, A)=H (D)-H (D | A)
Wherein, H (D) is empirical entropy, and H (D | A) is the empirical condition entropy of selected feature A;
Second building module, for constructing random forest according to random forest parameter and information gain value, using random gloomy Woods algorithm more decision trees of training generate harassing call identification model.
Further, system further include: evaluation module knows the harassing call for the recognition effect assessed value Other model is assessed, wherein the recognition effect assessed value includes rate of precision, recall rate and F1-score, calculation formula It is as follows respectively:
Precision (rate of precision)=TP/ (TP+FP)
Recall (recall rate)=TP/ (TP+FN)
F1-score=2*Precision*Recall/ (Precision+Recall)
Wherein, TP representative sample is positive, and the number that prediction result is positive, FP representative sample is negative, what prediction result was positive Number, FN representative sample are positive, the number that prediction result is negative.
Beneficial effects of the present invention: behavior of manually judging by accident and fail to judge not only effectively is avoided, but also reduces the error of model, together When, improve the availability and practicability of model.
Detailed description of the invention
It in order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, below will be to institute in embodiment Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the invention Example, for those of ordinary skill in the art, without creative efforts, can also obtain according to these attached drawings Obtain other attached drawings.
Fig. 1 is the flow chart of the harassing call recognition methods based on random forests algorithm described according to embodiments of the present invention One of;
Fig. 2 is the structure of the harassing call identifying system based on random forests algorithm described according to embodiments of the present invention Figure;
Fig. 3 is the user's loan repayment capacity prediction single tree tree construction schematic diagram described according to embodiments of the present invention;
Fig. 4 is the blind date prediction single tree tree construction schematic diagram described according to embodiments of the present invention;
Fig. 5 is the motion prediction single tree tree construction schematic diagram described according to embodiments of the present invention;
Fig. 6 is middle city of Jiangsu province harassing call random forest single tree structural schematic diagram according to embodiments of the present invention;
Fig. 7 is middle Hebei province city's harassing call random forest single tree structural schematic diagram according to embodiments of the present invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, those of ordinary skill in the art's every other embodiment obtained belong to what the present invention protected Range.
As shown in Figure 1, a kind of harassing call identification side based on random forests algorithm described according to embodiments of the present invention Method, comprising the following steps:
Raw data set is handled, determines the distribution proportion of positive and negative harassing call sample;
Data set D is a month message registration (encryption), and initial data dimension is 49 dimensions, is spread out by data cleansing, variable Input variable x1, x2, x3 ... the x14 of random forest is obtained after raw and Feature Selection, detailed description are as follows for variable:
X1: max time point, caller dials ticket sum are taken in the number of calls;
X2: max time point is taken in the number of calls, calling number dials different called number sums;
X3: max time point, the total duration of call of calling number are taken in the number of calls;
X4: max time point, calling number duration of call maximum value are taken in the number of calls;
X5: max time point, calling number duration of call mean value are taken in the number of calls;
X6: max time point, calling number duration of call standard deviation are taken in the number of calls;
X7: max time point is taken in the number of calls, the calling number duration of call is 0 ticket number;
X8: max time point is taken in the number of calls, the calling number duration of call is not 0 ticket number;
X9: max time point, the non-zero ticket number number of the calling number duration of call are taken in the number of calls;
X10: max time point, calling number local call total degree are taken in the number of calls;
X11: max time point, calling number other places call count are taken in the number of calls;
X12: max time point, calling number ring duration maximum value are taken in the number of calls;
X13: max time point, calling number ring duration mean value are taken in the number of calls;
X14: max time point, calling number ring duration standard deviation are taken in the number of calls.
New samples collection is generated after carrying out over-sampling to the raw data set using SMOTE algorithm;
According to the new samples collection construct harassing call identification model, and initialize in the harassing call identification model with Machine forest parameters;
It is verified after training harassing call identification model using ten folding cross validations, and calculates the assessment of its recognition effect Value;
It is the different numerical value of each parameter setting in conjunction with random forest parameter, using 10 folding cross validations and grid search, Continuous fitting data and training harassing call identification model, and export harassing call identification model trained every time and assess parameter.
Obtain optimal harassing call identification model, using pickle successively by optimal harassing call identification model serializing and API is constructed after unserializing, optimal harassing call identification model is online with interface mode deployment;
When message registration to be measured reaches, api interface is called, optimal harassing call identification model is entered data into and carries out in advance It surveys.
Specifically, Original CDR data set (encryption) is handled, the distribution proportion of positive negative sample is determined;Using SMOTE algorithm Over-sampling is carried out to raw data set, forms final new samples collection, and new samples collection is classified, training set and test Collection;Initialize random forest parameter in the harassing call identification model;Using the training harassing call identification of ten folding cross validations Model is verified using test set, and calculates the rate of precision, recall rate and F1 score of harassing call identification model;Using Grid search obtains optimal harassing call identification model, and is serialized harassing call identification model using pickle, saves To server;Using pickle by harassing call identification model unserializing, and Flask framework establishment API is utilized, is connect with API The mode of mouth, the harassing call identification model deployment for meeting business demand is online, realize the swindle real-time predictive intercept of short-term;To When surveying message registration arrival, api interface is called, is entered data into harassing call identification model to realize that harassing call is pre- It surveys, returns to prediction result to after the prediction of harassing call identification model, i.e., if it is harassing call, api interface returns to 1, if It is normal telephone, api interface returns to 0.
In one particular embodiment of the present invention, after carrying out over-sampling to the raw data set using SMOTE algorithm Generating new samples collection includes:
The harassing call sample of minority class is analyzed, and is added according to the artificial synthesized new samples of harassing call sample It is added to initial data concentration;
The harassing call sample of each minority class, randomly selects several first samples from its arest neighbors;
The second sample is randomly selected on harassing call sample and the line of first sample.
That is, the harassing call sample A to minority class is analyzed, and artificial synthesized according to the harassing call sample of minority class New samples are added to initial data concentration;The harassing call sample a of each minority class randomly chooses one from its arest neighbors Then first sample b randomly selects a point as newly synthesized minority class sample (the second sample), tool on the line of a, b The algorithm steps of body are as follows:
(1) for each sample x in minority class, using Euclidean distance as criterion calculation, it owns into minority class sample set The distance of sample obtains k neighbour, wherein Euclidean distance d is calculated as shown in (1):
Wherein, the dimension of N representative sample data, x1iRepresent first sample, i-th of dimension, x2iRepresent second sample I-th of dimension.
(2) for each minority class sample x, several samples are randomly choosed from its k neighbour, it is assumed that the neighbour of selection For xn;
(3) for the neighbour xn that each is selected at random, stochastic linear interpolation is carried out, constructs new sample with original sample respectively This;
(4) new samples are put into former data, generate new training set and forms final new samples after SMOTE is sampled Collection, wherein new samples collection includes training sample and test sample.
In one particular embodiment of the present invention, harassing call identification model is constructed for the new samples collection, and just Random forest parameter includes: in the beginningization harassing call identification model
Be arranged random forest parameter, wherein the random forest parameter include the number of decision tree, have the sampling put back to, The depth capacity of the feature and tree that divide when information gain, most suitable attribute, information gain are for carrying out in decision-tree model The information gain of the index of feature selecting, some feature is bigger, then the selectivity of this feature is better.
Random forest parameter and parameter interpretation are as follows:
N_estimators=60: the number of decision tree;
Bootstrap=True: there is the sampling put back to;
Criterion=entropy: the information gain of computation attribute, to select most suitable node;
Max_features=sqrt: the feature divided when selection most suitable attribute is no more than this value;
Max_depth=8: the depth capacity of tree;
Min_samples_split=20: each to divide least sample number when according to Attribute transposition node;
Min_samples_leaf=100: the least sample number of leaf node;
The information gain of computation attribute chooses most suitable node, and child node computes repeatedly information gain, and chooses information Gain maximum node, successively opinion pushes away, and generates more trees, and the calculation formula of the information gain is as follows:
G (D, A)=H (D)-H (D | A) (2)
Wherein, H (D) is empirical entropy, and H (D | A) is the empirical condition entropy of selected feature A, and calculation formula is respectively such as formula (3), shown in formula (4):
Training data set D, | D | it is sample size, the i.e. number (element number in D) of sample, is equipped with K class CkCome It indicates, | Ck| it is CiNumber of samples, | Ck| the sum of be | D |, D is divided into n subset D according to feature A by k=1,2 ... ..1, D2.....Dn, | Di| it is DiNumber of samples, | Di| the sum of be | D |, i=1,2 ..., remember DiIn belong to CkSample set be Dik, i.e. intersection, | Dik| it is DikNumber of samples.
Random forest is constructed according to random forest parameter and information gain value, utilizes random forests algorithm more decisions of training Tree generates harassing call identification model.
Random forest is, random forest building during needs two parameters built-up by more decision trees, The number t of decision tree, the number m of input feature vector in need of consideration in each node split of decision tree, wherein single decision The building process of tree is as follows:
(1) enabling N is the number of training examples, then the number of the input sample of single decision tree is concentrated with to be N number of from training That puts back to randomly selects N number of training examples;
(2) number for enabling the input feature vector of training examples is M (M=14), cuts m and is far smaller than M, then in every decision tree When the enterprising line splitting of each node, m input feature vector is randomly choosed in M input feature vector, then in this m input feature vector The selection maximum node of information gain is divided, wherein m will not change during constructing decision tree;
(3) every decision tree all go down so always by division, until all training examples of the node belong to same class, Beta pruning is not needed.
Decision tree is a kind of algorithm that decision is carried out using tree structure, according to known conditions or is spy for sample data Sign carries out bifurcated, finally establishes one tree, and the leaf tubercle of tree identifies final decision, and new data can be according to this tree Judged.Random forest is a kind of algorithm that decision is optimized by more decision trees, there is the selection put back to training at random Data then structural classification device, finally with ensemble learning to model increase whole classifying quality.
As shown in figure 3, user's loan repayment capacity is predicted, whether owned a house by client, if get married, average monthly income is pre- Survey whether loan user has the ability repaid the loan.Each internal tubercle indicates an attribute conditions judgement, leaf knot Section indicates whether loan user has repaying ability.When decision tree selects feature, the maximum feature of information gain value should be selected, As the tubercle splitting condition, the information gain value of other each features is calculated with this, forms more trees, finally there is voting mechanism Judge whether the client has the ability repaid the loan.
As shown in figure 4, blind date prediction, by the essential characteristic for the boy that blindly dates, such as: moral standing, wealth, work, appearance, prediction Whether girl, which goes, is blindly dated.Each internal tubercle indicates the condition judgement an of attribute, and leaf node indicates that girl chooses whether Blind date;When decision tree selects feature, the information gain value of each feature is calculated first, and by the information gain value of each feature Descending sort is carried out, selects the maximum feature of information gain value as root node, calculates the information gain of other nodes, and select The maximum feature of information gain carries out second division, and so on repeatedly divided, more trees are formed, finally by gloomy at random The voting mechanism of woods provides whether the girl goes to blindly date.
As shown in figure 5, motion prediction, such as by given meteorological data: situations such as humidity, wind-force and weather forecast, Prediction can or can not go out to play ball for gains in depth of comprehension one day, respectively using humidity wind-force, weather forecast as root node, calculate its information gain, Selecting the maximum weather forecast of information gain is root node, and child node computes repeatedly information gain, and chooses information gain maximum Node be next root node, according to this opinion push away, until node cannot divide, generate more tree, finally by random forest Voting mechanism predict under a certain weather, if go to play ball.
More decision trees can be generated during the training of harassing call identification model, to a each new test sample, The classification results of comprehensive more decision trees, classification knot of the classification for taking single tree classification results most as entire random forest Fruit.
In one particular embodiment of the present invention, this method further includes that the recognition effect assessed value is to the harassing and wrecking Phone identification model is assessed, wherein the recognition effect assessed value includes rate of precision, recall rate and F1-score, meter It is as follows to calculate formula difference:
Precision (rate of precision)=TP/ (TP+FP) (5)
Recall (recall rate)=TP/ (TP+FN) (6)
F1-score=2*Precision*Recall/ (Precision+Recall) (7)
Wherein, TP representative sample is positive, and the number that prediction result is positive, FP representative sample is negative, what prediction result was positive Number, FN representative sample are positive, the number that prediction result is negative.
As shown in Fig. 2, another aspect of the present invention, provides a kind of harassing call identification system based on random forests algorithm System, comprising:
Determining module determines the distribution proportion of positive and negative harassing call sample for handling raw data set;
Generation module, for generating new samples collection after carrying out over-sampling to the raw data set using SMOTE algorithm;
First building module for constructing harassing call identification model according to the new samples collection, and is disturbed described in initialization Disturb random forest parameter in phone identification model;
Authentication module for being verified after training harassing call identification model using ten folding cross validations, and calculates it Recognition effect assessed value;
Module is obtained successively to know optimal harassing call using pickle for obtaining optimal harassing call identification model API is constructed after other Model sequence and unserializing, optimal harassing call identification model is online with interface mode deployment;
Identification module when reaching for message registration to be measured, calls api interface, enters data into optimal harassing call Identification model is predicted.
In one particular embodiment of the present invention, the generation module includes:
Analysis module is analyzed for the harassing call sample to minority class, and according to the harassing call sample people Work synthesis new samples are added to initial data concentration;
First chooses module, and for the harassing call sample of each minority class, several the are randomly selected from its arest neighbors One sample;
Second chooses module, for randomly selecting the second sample on the line of harassing call sample and first sample.
In one particular embodiment of the present invention, the first building module includes:
Parameter setting module, for random forest parameter to be arranged, wherein the random forest parameter includes of decision tree The depth capacity of the feature and tree that are divided when counting, having the sampling put back to, information gain, most suitable attribute;
First computing module chooses most suitable node for the information gain of computation attribute, and child node computes repeatedly letter Gain is ceased, and chooses information gain maximum node, successively opinion pushes away, and generates more trees, and the calculation formula of the information gain is as follows:
G (D, A)=H (D)-H (D | A) (2)
Wherein, H (D) is empirical entropy, and H (D | A) is the empirical condition entropy of selected feature A;
Second building module, for constructing random forest according to random forest parameter and information gain value, using random gloomy Woods algorithm more decision trees of training generate harassing call identification model.
In one particular embodiment of the present invention, system further include: evaluation module is assessed for the recognition effect Value assesses the harassing call identification model, wherein the recognition effect assessed value include rate of precision, recall rate and F1-score, calculation formula difference are as follows:
Precision (rate of precision)=TP/ (TP+FP) (5)
Recall (recall rate)=TP/ (TP+FN) (6)
F1-score=2*Precision*Recall/ (Precision+Recall) (7)
Wherein, TP representative sample is positive, and the number that prediction result is positive, FP representative sample is negative, what prediction result was positive Number, FN representative sample are positive, the number that prediction result is negative.
In order to facilitate understanding above-mentioned technical proposal of the invention, below by way of in specifically used mode to of the invention above-mentioned Technical solution is described in detail.
Embodiment one
Data be city of Jiangsu province Communications Administration Bureau user bill data, data dimension x1, x2, x3, x4, x5, x6, X7, x8, x9, x10, x11, x12, x13, x14 totally 14 dimension datas, by taking single encrypted data as an example, each dimension numerical value is- 0.049、-0.059、-0.270、- 0.264、-0.339、-0.079、0.052、0.039、0.055、-0.052、0.092、- 0.247, -0.042, -0.057, WEB terminal calls api interface, and data are transported in harassing call identification model, and data enter In harassing call identification model, information gain can be calculated by root node of each node, the maximum spy of information gain value should be selected Sign calculates the information gain value of other each features with this as root node, forms more trees, finally there is random forest ballot Mechanism judges whether the calling number is harassing call, and is returned the result in a manner of JSON, returns to 1 if it is harassing call, 0 is returned if not harassing call.
As shown in fig. 6, root node is each data dimension, line is Rule of judgment, and leaf node is the output of single tree Target, during prediction, each corresponding node successively judges according to condition, until reaching last leaf node, single tree Judgement terminates, this single tree is with x1 (- 0.049) for root node, and x1<-0.019, left branch judges x2 (- 0.059), and x2>- 0.220, leaf node is " no ", single tree judges that this is recorded as " no ", leaf node is in whole random forest " It is no " ratio be less than leaf node " be " and ratio, so this final data is judged as YES harassing call, model output 1, interface returns to 1 and calls end to WEB.
Embodiment two
Data be Communications Administration Bureau, city, Hebei province user bill data, data dimension x1, x2, x3, x4, x5, x6, X7, x8, x9, x10, x11, x12, x13, x14 totally 14 dimension datas, by taking single encrypted data as an example, each dimension numerical value is- 0.149、-0.259、-0.170、- 0.364、-0.239、-0.479、0.152、0.239、0.155、-0.152、0.194、- 0.127,- 0.542,-0.257;WEB terminal calls api interface, and data are transported in harassing call identification model, and data enter In harassing call identification model, information gain can be calculated by root node of each node, the maximum spy of information gain value should be selected Sign calculates the information gain value of other each features with this as root node, forms more trees, finally there is random forest ballot Mechanism judges whether the calling number is harassing call, and is returned the result in a manner of JSON, returns to 1 if it is harassing call, 0 is returned if not harassing call.
As shown in fig. 7, each corresponding node successively judges according to condition during the prediction of harassing call identification model, Until reaching last leaf node, single tree judgement terminates, the single tree with x3 (- 0.170) for root node, x3 < -0.02, Left branch judges x4 (- 0.364) that x4<-0.218, left branch judges x9 (0.155), x9>-0.335, and leaf node is " no ", Single tree judges that this is recorded as " it is no ", leaf node is in whole random forest " it is no " ratio be greater than leaf node For " be " ratio, so this final data is judged as it is not harassing call, model output 0, interface returns to 0 and calls to WEB End.
In conclusion behavior of manually judging by accident and fail to judge not only effectively is avoided by means of above-mentioned technical proposal of the invention, and And the error of model is reduced, meanwhile, improve the availability and practicability of model.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Within mind and principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Claims (8)

1. a kind of harassing call recognition methods based on random forests algorithm, which comprises the following steps:
Raw data set is handled, determines the distribution proportion of positive and negative harassing call sample;
New samples collection is generated after carrying out over-sampling to the raw data set using SMOTE algorithm;
Harassing call identification model is constructed according to the new samples collection, and is initialized random gloomy in the harassing call identification model Woods parameter;
It is verified after training harassing call identification model using ten folding cross validations, and calculates its recognition effect assessed value;
Optimal harassing call identification model is obtained, using pickle successively by the serializing of optimal harassing call identification model and inverted sequence API is constructed after columnization, optimal harassing call identification model is online with interface mode deployment;
When message registration to be measured reaches, api interface is called, optimal harassing call identification model is entered data into and is predicted.
2. the harassing call recognition methods according to claim 1 based on random forests algorithm, which is characterized in that utilize Generation new samples collection includes: after SMOTE algorithm carries out over-sampling to the raw data set
The harassing call sample of minority class is analyzed, and is added to according to the artificial synthesized new samples of harassing call sample Initial data is concentrated;
The harassing call sample of each minority class, randomly selects several first samples from its arest neighbors;
The second sample is randomly selected on harassing call sample and the line of first sample.
3. the harassing call recognition methods according to claim 1 based on random forests algorithm, which is characterized in that be directed to institute New samples collection building harassing call identification model is stated, and initializes random forest parameter packet in the harassing call identification model It includes:
Random forest parameter is set, wherein the random forest parameter includes the number of decision tree, has the sampling put back to, information The depth capacity of the feature and tree that are divided when gain, most suitable attribute;
The information gain of computation attribute chooses most suitable node, and child node computes repeatedly information gain, and chooses information gain Maximum node, successively opinion pushes away, and generates more trees, and the calculation formula of the information gain is as follows:
G (D, A)=H (D)-H (D | A)
Wherein, H (D) is empirical entropy, and H (D | A) is the empirical condition entropy of selected feature A;
Random forest is constructed according to random forest parameter and information gain value, it is raw using random forests algorithm more decision trees of training At harassing call identification model.
4. the harassing call recognition methods according to claim 1-3 based on random forests algorithm, feature exist In this method further includes that the recognition effect assessed value assesses the harassing call identification model, wherein the knowledge Other recruitment evaluation value includes rate of precision, recall rate and F1-score, and calculation formula difference is as follows:
Precision (rate of precision)=TP/ (TP+FP)
Recall (recall rate)=TP/ (TP+FN)
F1-score=2*Precision*Recall/ (Precision+Recall)
Wherein, TP representative sample is positive, and the number that prediction result is positive, FP representative sample is negative, the number that prediction result is positive, FN representative sample is positive, the number that prediction result is negative.
5. a kind of harassing call identifying system based on random forests algorithm characterized by comprising
Determining module determines the distribution proportion of positive and negative harassing call sample for handling raw data set;
Generation module, for generating new samples collection after carrying out over-sampling to the raw data set using SMOTE algorithm;
First building module for constructing harassing call identification model according to the new samples collection, and initializes the harassing and wrecking electricity Talk about random forest parameter in identification model;
Authentication module for being verified after training harassing call identification model using ten folding cross validations, and calculates its identification Recruitment evaluation value;
Module is obtained, for obtaining optimal harassing call identification model, optimal harassing call is successively identified into mould using pickle API is constructed after type serializing and unserializing, optimal harassing call identification model is online with interface mode deployment;
Identification module when reaching for message registration to be measured, calls api interface, enters data into optimal harassing call identification Model is predicted.
6. the harassing call identifying system according to claim 5 based on random forests algorithm, which is characterized in that the life Include: at module
Analysis module is analyzed for the harassing call sample to minority class, and is manually closed according to the harassing call sample It is added to initial data at new samples to concentrate;
First selection module randomly selects several first samples for the harassing call sample of each minority class from its arest neighbors This;
Second chooses module, for randomly selecting the second sample on the line of harassing call sample and first sample.
7. the harassing call identifying system according to claim 5 based on random forests algorithm, which is characterized in that described One, which constructs module, includes:
Parameter setting module, for random forest parameter to be arranged, wherein the random forest parameter include decision tree number, The depth capacity of the feature and tree that are divided when having the sampling put back to, information gain, most suitable attribute;
First computing module chooses most suitable node for the information gain of computation attribute, and child node computes repeatedly information increasing Benefit, and information gain maximum node is chosen, successively opinion pushes away, and generates more trees, and the calculation formula of the information gain is as follows:
G (D, A)=H (D)-H (D | A)
Wherein, H (D) is empirical entropy, and H (D | A) is the empirical condition entropy of selected feature A;
Second building module is calculated for constructing random forest according to random forest parameter and information gain value using random forest Method more decision trees of training generate harassing call identification model.
8. according to the described in any item harassing call identifying systems based on random forests algorithm of claim 5-7, feature exists In the system further include: evaluation module comments the harassing call identification model for the recognition effect assessed value Estimate, wherein the recognition effect assessed value includes rate of precision, recall rate and F1-score, and calculation formula difference is as follows:
Precision (rate of precision)=TP/ (TP+FP)
Recall (recall rate)=TP/ (TP+FN)
F1-score=2*Precision*Recall/ (Precision+Recall)
Wherein, TP representative sample is positive, and the number that prediction result is positive, FP representative sample is negative, the number that prediction result is positive, FN representative sample is positive, the number that prediction result is negative.
CN201910339683.6A 2019-04-25 2019-04-25 Harassing call recognition methods and system based on random forests algorithm Pending CN110147430A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910339683.6A CN110147430A (en) 2019-04-25 2019-04-25 Harassing call recognition methods and system based on random forests algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910339683.6A CN110147430A (en) 2019-04-25 2019-04-25 Harassing call recognition methods and system based on random forests algorithm

Publications (1)

Publication Number Publication Date
CN110147430A true CN110147430A (en) 2019-08-20

Family

ID=67594472

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910339683.6A Pending CN110147430A (en) 2019-04-25 2019-04-25 Harassing call recognition methods and system based on random forests algorithm

Country Status (1)

Country Link
CN (1) CN110147430A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111062422A (en) * 2019-11-29 2020-04-24 上海观安信息技术股份有限公司 Method and device for systematic identification of road loan
CN112364901A (en) * 2020-10-30 2021-02-12 上海欣方智能系统有限公司 LGB algorithm-based fraud call identification method
CN112464058A (en) * 2020-11-30 2021-03-09 上海欣方智能系统有限公司 XGboost algorithm-based telecommunication internet fraud identification method
CN112866486A (en) * 2021-02-01 2021-05-28 西安交通大学 Multi-source feature-based fraud telephone identification method, system and equipment
CN113163057A (en) * 2021-01-20 2021-07-23 北京工业大学 Method for constructing dynamic identification interval of fraud telephone
CN113837303A (en) * 2021-09-29 2021-12-24 中国联合网络通信集团有限公司 Black product user identification method, TEE node and computer readable storage medium
CN114189585A (en) * 2020-09-14 2022-03-15 中国移动通信集团重庆有限公司 Crank call abnormity detection method and device and computing equipment
CN114979369A (en) * 2022-04-14 2022-08-30 马上消费金融股份有限公司 Abnormal call detection method and device, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105550583A (en) * 2015-12-22 2016-05-04 电子科技大学 Random forest classification method based detection method for malicious application in Android platform
CN106255116A (en) * 2016-08-24 2016-12-21 王瀚辰 A kind of recognition methods harassing number
CN106446566A (en) * 2016-09-29 2017-02-22 北京理工大学 Elderly cognitive function classification method based on random forest

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105550583A (en) * 2015-12-22 2016-05-04 电子科技大学 Random forest classification method based detection method for malicious application in Android platform
CN106255116A (en) * 2016-08-24 2016-12-21 王瀚辰 A kind of recognition methods harassing number
CN106446566A (en) * 2016-09-29 2017-02-22 北京理工大学 Elderly cognitive function classification method based on random forest

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111062422A (en) * 2019-11-29 2020-04-24 上海观安信息技术股份有限公司 Method and device for systematic identification of road loan
CN111062422B (en) * 2019-11-29 2023-07-14 上海观安信息技术股份有限公司 Method and device for identifying set-way loan system
CN114189585A (en) * 2020-09-14 2022-03-15 中国移动通信集团重庆有限公司 Crank call abnormity detection method and device and computing equipment
CN112364901A (en) * 2020-10-30 2021-02-12 上海欣方智能系统有限公司 LGB algorithm-based fraud call identification method
CN112464058A (en) * 2020-11-30 2021-03-09 上海欣方智能系统有限公司 XGboost algorithm-based telecommunication internet fraud identification method
CN112464058B (en) * 2020-11-30 2024-08-20 上海欣方智能系统有限公司 Telecommunication Internet fraud recognition method based on XGBoost algorithm
CN113163057A (en) * 2021-01-20 2021-07-23 北京工业大学 Method for constructing dynamic identification interval of fraud telephone
CN113163057B (en) * 2021-01-20 2022-09-30 北京工业大学 Method for constructing dynamic identification interval of fraud telephone
CN112866486A (en) * 2021-02-01 2021-05-28 西安交通大学 Multi-source feature-based fraud telephone identification method, system and equipment
CN112866486B (en) * 2021-02-01 2022-06-07 西安交通大学 Multi-source feature-based fraud telephone identification method, system and equipment
CN113837303A (en) * 2021-09-29 2021-12-24 中国联合网络通信集团有限公司 Black product user identification method, TEE node and computer readable storage medium
CN114979369A (en) * 2022-04-14 2022-08-30 马上消费金融股份有限公司 Abnormal call detection method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110147430A (en) Harassing call recognition methods and system based on random forests algorithm
CN103729678A (en) Navy detection method and system based on improved DBN model
CN106651424A (en) Electric power user figure establishment and analysis method based on big data technology
Wen et al. Multi-level deep cascade trees for conversion rate prediction in recommendation system
Anandalingam et al. A multi-stage multi-attribute decision model for project selection
CN107155010A (en) The methods, devices and systems of user speech calling are handled based on big data
CN106202031B (en) System and method for associating group members based on group chat data
CN112464058B (en) Telecommunication Internet fraud recognition method based on XGBoost algorithm
CN104462592A (en) Social network user behavior relation deduction system and method based on indefinite semantics
Zhu et al. Ensemble methodology: Innovations in credit default prediction using lightgbm, xgboost, and localensemble
CN104809635A (en) Dynamic internet comment analysis method
CN103488637B (en) A kind of method carrying out expert Finding based on dynamics community&#39;s excavation
CN110598129A (en) Cross-social network user identity recognition method based on two-stage information entropy
CN109960722A (en) A kind of information processing method and device
CN115456695A (en) Method, device, system and medium for analyzing shop address selection
CN109686402A (en) Based on key protein matter recognition methods in dynamic weighting interactive network
CN109949174A (en) A kind of isomery social network user entity anchor chain connects recognition methods
CN112116168A (en) User behavior prediction method and device and electronic equipment
CN107368499A (en) A kind of client&#39;s tag modeling and recommendation method and device
He et al. A study on prediction of customer churn in fixed communication network based on data mining
CN110992194A (en) User reference index algorithm based on attribute-containing multi-process sampling graph representation learning model
CN109670998A (en) Based on the multistage identification of accurate subsidy and system under the big data environment of campus
CN109543041A (en) A kind of generation method and device of language model scores
CN108717445A (en) A kind of online social platform user interest recommendation method based on historical data
CN107222319A (en) A kind of traffic operation analysis method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190820

RJ01 Rejection of invention patent application after publication