CN106255116A - A kind of recognition methods harassing number - Google Patents
A kind of recognition methods harassing number Download PDFInfo
- Publication number
- CN106255116A CN106255116A CN201610710545.0A CN201610710545A CN106255116A CN 106255116 A CN106255116 A CN 106255116A CN 201610710545 A CN201610710545 A CN 201610710545A CN 106255116 A CN106255116 A CN 106255116A
- Authority
- CN
- China
- Prior art keywords
- harassing
- wrecking
- time
- random forest
- decision tree
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W12/00—Security arrangements; Authentication; Protecting privacy or anonymity
- H04W12/12—Detection or prevention of fraud
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04M—TELEPHONIC COMMUNICATION
- H04M3/00—Automatic or semi-automatic exchanges
- H04M3/22—Arrangements for supervision, monitoring or testing
- H04M3/2281—Call monitoring, e.g. for law enforcement purposes; Call tracing; Detection or prevention of malicious calls
Landscapes
- Engineering & Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- Signal Processing (AREA)
- Technology Law (AREA)
- Computer Networks & Wireless Communication (AREA)
- Telephonic Communication Services (AREA)
Abstract
A kind of recognition methods harassing number, include: choose some confirmed harassing and wrecking and non-harassing and wrecking number, calculate described harassing and wrecking and non-harassing and wrecking number communication behavior index within a period of time, then described harassing and wrecking and non-harassing and wrecking number and communication behavior index thereof are formed training sample set and build random forest disaggregated model, the input of described random forest disaggregated model is the communication behavior index of each Subscriber Number, and output is that all decision trees determine that it is harassing and wrecking number and the prediction probability of non-harassing and wrecking number;By the number to be identified communication behavior index input random forest disaggregated model within a period of time, and calculate all decision trees in random forest disaggregated model and determine that it is harassing and wrecking number and the prediction probability of non-harassing and wrecking number, to judge whether described number to be identified is harassing and wrecking numbers accordingly.The invention belongs to network communication technology field, the call features of calling and called number can be made full use of, from the magnanimity traffic data of existing network, effectively identify harassing and wrecking number.
Description
Technical field
The present invention relates to a kind of recognition methods harassing number, belong to network communication technology field.
Background technology
Harassing call, to promote ad content, swindle information, has become as the illegal occupation disturbed social tranquility.Logical
Crossing comprehensive analysis, harassing call generally has the following characteristics that
1, called dispersion, harassing and wrecking number breathes out multiple number within the unit interval, and frequency is high, and between each called number
Dependency is less;
2, harassing call and called between usual dependency more weak, i.e. history call relation is little, and it is usual to harass number
The quantity initiating calling as caller is far longer than it as called quantity;
3, the duration of call of harassing call is the shortest, and the probability of Called Onhook is bigger;
4, harassing call is generally of calling frequency height and integrated distribution in the feature of certain time period.
Patent application CN200910079707.5 (application title: the recognition methods of a kind of harassing call and device, application
Day: 2009-03-06, applicant: ZTE Co., Ltd) disclose recognition methods and the device of a kind of harassing call,
Introduce the identifying processing to strange telephone number in the mobile phone, by interval to call time of Stranger Calls number,
Calling duration length and the statistics of incoming call number of times, automatically compare with the judgment rule of user, identify harassing and wrecking
Phone.This technical scheme relates only to the statistics of interval call time, calling duration length and incoming call number of times and knows
Do not harass number, it is judged that method is very simple, and the call features underusing calling and called number to talk about from the magnanimity of existing network
Business data effectively identify harassing and wrecking number.
Therefore, how to make full use of the call features of calling and called number, effectively identify from the magnanimity traffic data of existing network
Harassing and wrecking number, is still a technical problem being worth further investigation.
Summary of the invention
In view of this, it is an object of the invention to provide a kind of recognition methods harassing number, calling and called number can be made full use of
The call features of code, effectively identifies harassing and wrecking number from the magnanimity traffic data of existing network.
In order to achieve the above object, the invention provides a kind of recognition methods harassing number, include:
Step one, choose some confirmed harassing and wrecking and non-harassing and wrecking number, calculate described harassing and wrecking and non-harassing and wrecking number one
Communication behavior index in the section time, then forms training sample by described harassing and wrecking and non-harassing and wrecking number and communication behavior index thereof
Collection builds random forest disaggregated model, and the input of described random forest disaggregated model is that the communication behavior of each Subscriber Number refers to
Mark, output is that all decision trees determine that it is harassing and wrecking number and the prediction probability of non-harassing and wrecking number;
Step 2, by number to be identified within a period of time communication behavior index input random forest disaggregated model, and
Calculate all decision trees in random forest disaggregated model and determine that it is harassing and wrecking number and the prediction probability of non-harassing and wrecking number, with accordingly
Judge whether described number to be identified is harassing and wrecking numbers.
Compared with prior art, the invention has the beneficial effects as follows: call dispersion, called relation loop, exhalation incoming call ratio, exhale
Making the communication behavior indexs such as Annual distribution can effectively embody the behavioral characteristic of harassing and wrecking number, the present invention uses random forest
Disaggregated model, to call the frequency, called number, the duration of call, ring duration, actively to discharge number of times, passively release number of times, called
Dispersion, same caller called number between correlation coefficient, call identical No. ten thousand the section maximum frequencys, caller accounting, call times
Multiple communication behavior indexs such as separation standard difference are as input, and are judged to harass number and non-harassing and wrecking number according to all decision trees
The probability of code identifies harassing and wrecking number, it is thus possible to utilize the call features of calling and called number, fully excavates in a large amount of training samples
Data characteristics, from the magnanimity traffic data of existing network, effectively identify harassing and wrecking number, and communication behavior index can also basis
It is actually needed and is adjusted flexibly;Owing to harassing call has, the calling frequency is high and integrated distribution is in certain time period, this
The call bill data of whole day is divided into the communication time period with multiple time granularities as duration by invention the most further, and during based on difference
Between high-frequency communication period under granularity calculate the various communication behavior indexs of Subscriber Number, it is thus possible to improve harassing and wrecking number further
Code identify quasi real time and high efficiency;The present invention can also build multiple random forest disaggregated model, and obtains according to after test
The discrimination of random forest disaggregated model therefrom select an optimum random forest disaggregated model.
Accompanying drawing explanation
Fig. 1 is a kind of recognition methods flow chart harassing number of the present invention.
Fig. 2 is the concrete operations flow chart of step A.
Fig. 3 is in step 11, and for kth decision tree, k=1,2 ..., K, it generates the concrete operations flow process of process
Figure.
Fig. 4 is the concrete operations flow chart of Fig. 1 step 2.
Fig. 5 is that the present invention builds a test sample collection and tests multiple random forest disaggregated models respectively, and root
The concrete operations flow chart of an optimum random forest disaggregated model is therefrom selected according to test result.
Detailed description of the invention
For making the object, technical solutions and advantages of the present invention clearer, below in conjunction with the accompanying drawings the present invention is made further
Detailed description.
As it is shown in figure 1, a kind of recognition methods harassing number of the present invention, include:
Step one, choose some confirmed harassing and wrecking and non-harassing and wrecking number, calculate described harassing and wrecking and non-harassing and wrecking number one
Communication behavior index in the section time, then forms training sample by described harassing and wrecking and non-harassing and wrecking number and communication behavior index thereof
Collection builds random forest disaggregated model, and the input of described random forest disaggregated model is that the communication behavior of each Subscriber Number refers to
Mark, output is that all decision trees determine that it is harassing and wrecking number and the prediction probability of non-harassing and wrecking number;
Step 2, by number to be identified within a period of time communication behavior index input random forest disaggregated model, and
Calculate all decision trees in random forest disaggregated model and determine that it is harassing and wrecking number and the prediction probability of non-harassing and wrecking number, with accordingly
Judge whether described number to be identified is harassing and wrecking numbers.
In step one, can by history it has been acknowledged that harassing and wrecking number (such as by Internet firm obtain or pass through
The harassing and wrecking number etc. of operator report and complaint system mark) blacklist and white list workbook choose confirmed harassing and wrecking
With non-harassing and wrecking number, then by the way of signal collecting, gather call event letter from equipment such as signaling monitoring system or A mouths
Make call bill data or gather history call bill data from BOSS, thus obtaining and above-mentioned choose number communication within a period of time
Information, and the communications records that wrong for wherein critical field data form or critical field data exist vacancy value reject.
Owing to harassing call has, the calling frequency is high and integrated distribution is in certain time period, simultaneously in order to further
Improve harassing and wrecking Number Reorganization quasi real time and high efficiency, the present invention is also based on different time granularities and calculates user respectively
The communication behavior index of number, described time granularity can with value but be not limited to: 1 minute, 5 minutes, 15 minutes, 30 minutes, 60
Minute, 180 minutes, 360 minutes, 720 minutes, 1440 minutes, so, Subscriber Number communication behavior index within a period of time
User's communication behavior index, Subscriber Number and ID etc. under each time granularity, wherein, Yong Hubiao can be included
Know be used for identifying Subscriber Number be whether harassing and wrecking number (be such as harassing and wrecking numbers by the number mark chosen from blacklist, from
The number mark chosen in white list is non-harassing and wrecking number).In the present invention, calculate Subscriber Number based on time granularity and (include
Harassing and wrecking and non-harassing and wrecking number and number to be identified) communication behavior index within a period of time, it is also possible to farther include
Have:
Step A, collection user's history call bill data of continuous many days, choose multiple time granularity, then look for user and exist
High frequency talk period under each time granularity, finally according to leading in user's high frequency talk period under each time granularity
Letter behavioral indicator calculates user's communication behavior index under each time granularity.
As in figure 2 it is shown, step A can further include:
Step A1, extract the call bill data of user every day one by one;
Step A2, read beginning and ending time of this day call bill data, and calculate the maximum time that the described beginning and ending time covered
Granularity Tmax, when i.e. picking out its value less than corresponding duration maximum of described beginning and ending time from the multiple time granularities chosen
Between granularity;
If the call bill data gathered exists disappearance due to loss or other reasons, the most only collect 12:00---
History call bill data between 24:00, then maximum duration granularity Tmax in the beginning and ending time that the present invention only retains call bill data
Interior all time granularities;
Step A3, extract each time granularity one by one, and judge extracted time granularity whether less than or equal to Tmax,
If it is, the corresponding duration of the beginning and ending time of this day call bill data is divided into multiple continuous print and with the time extracted
Granularity is the communication time period of duration, then calculates user's call frequency in each communication time period, and described user is each logical
The call frequency in the letter period that is to say the extracted time granularity call frequency at each communication time period of this day, continues to carry
Take next time granularity, until having extracted all time granularities;If it is not, then continue to extract next time granularity, until
All time granularities are extracted;
The time granularity extracted must be less than or equal to Tmax, when such as Tmax=30 minute, then and the time grain extracted
Degree is respectively 1 minute, 5 minutes, 15 minutes, 30 minutes, and the time granularity more than Tmax does not the most remake calculating further, when one
The beginning and ending time of it call bill data is 0:00--24:00, when the time granularity T extracted is 30 minutes, and the communication being divided into
Period is respectively as follows: 0:00--0:30,0:31--1:00 ...., 23:01--23:30,23:31--24:00, and temporally granularity
The communication time period divided is all from the beginning of the 00 of this communication time period second, to terminating for first 1 minute the 59th second of next communication time period;
Step A4, judge whether to have extracted the call bill data in all skies?If it is, continue next step;If it is not, then continue
The continuous call bill data extracting next day of user, then turns to step A2;
Step A5, the call frequency of all communication time period in all skies, select maximum from each time granularity, institute
Stating communication time period corresponding to maximum is i.e. user's high-frequency communication period under this time granularity, and namely user is the most
The communication time period that the frequency is the highest and concentrates is called in it;
Step A6, calculating user's communication behavior index under each time granularity, that is to say that user is at each time grain
The communication behavior index in high frequency talk period under Du, described communication behavior index can include but not limited to: the calling frequency,
Called number, the duration of call, ring duration, actively discharge number of times, passively release number of times, called dispersion, the quilt of same caller
Call out the numbers intersymbol correlation coefficient, call identical No. ten thousand section maximum frequencys, caller accounting (i.e. breathing out number of times/exhalation incoming call total degree),
Separation standard call time poor (calculate this index called number and need 3 or more than 3) etc..Wherein, the called number of same caller
Intersymbol correlation coefficient is number and the user's calling that all called numbers that user called there are call behavior each other
The ratio of all called numbers sum crossed, such as, user A have called 100 quilts in the high frequency talk period of time granularity T
Calling out the numbers code, in called number at the appointed time section, (as within the training period of the 1-5 days) has between 4 called numbers B, C, D, E
There are call behavior and (such as co-exist in the calling having duration of call >=0 5 times: B-> C, D-> E, C-> B, C-> D, D-
> C), between the called number of the most same caller (i.e. user A), correlation coefficient is: 4/100;Calling identical No. ten thousand section maximum frequencys is
That user called and belong to the called number maximum number of identical No. ten thousand sections, No. ten thousand sections are after called number is removed latter 4
Residue section, such as, No. ten thousand sections of called number of user's calling have: 1395193,1395193,1390123,1390438,
The called number quantity of identical No. ten thousand sections is respectively 2,1,1, then calling identical No. ten thousand section maximum frequencys is 2.
In step A6, when user's communication behavior index number under each time granularity is a, a time granularity chosen
Number is b, then feature (the i.e. communication behavior index) number of the training sample each user of concentration of random forest disaggregated model can be:
M=a*b+2.
Random forest disaggregated model has can process higher-dimension attribute data, without doing, feature selection, training speed be fast, instruction
Influencing each other between attribute can be detected during white silk, can to realize parallelization, exportable Importance of attribute degree and classification general
The advantages such as rate and prediction classification, therefore can choose random forest disaggregated model for identifying harassing and wrecking number.
In the present invention, the basic thought of random forest disaggregated model is: first, utilize have the arbitrary sampling method put back to from
Extraction k group sample in original training set (N number of sample M dimension attribute), and often organize the sample size of sample all with original training set phase
With, it is N;Secondly, randomly selecting m dimension attribute, m value is less than or equal to total attribute dimension M;Then, N number of sample m dimension is belonged to every time
Property generate a decision tree, vertical K decision-tree model of building together, obtain K kind classification results;Finally, according to K kind classification results to often
Individual record is voted, and determines that it is finally classified.Therefore, the structure of random forest disaggregated model mainly has two parts, and one
Part is the structure of decision tree, formation decision tree forest, and the method that in the present invention, decision tree can use Gini impurity level is impure
Spending the least, attribute is the most important;Another part is decision making process, uses the mode of voting to export optimal classification result.So, step
In rapid one, described harassing and wrecking and non-harassing and wrecking number and communication behavior index thereof are formed training sample set and builds random forest classification
Model, it is also possible to farther included:
Step 11, employing random forest disaggregated model, concentrate M communication behavior of each training sample to refer to from training sample
Randomly selecting m communication behavior index in mark to be trained to produce K character subset, thus generate K decision tree, every certainly
Plan tree includes self prediction probability to input number generic.Wherein, decision tree can use the side of Gini impurity level
Method, impurity level is the least, then feature is the most important.
As it is shown on figure 3, in described step 11, for kth decision tree, k=1,2 ..., K, it is all right that it generates process
Farther include:
Step 111, employing bootstrap method have the N number of sample of the extraction put back to constitute kth from training sample set S
Sample set s (k) of the root node of decision tree, and set the minimum number of branch's Covering samples as d, wherein, N is training sample
The sample number of collection S, in training sample set S, the intrinsic dimensionality (i.e. communication behavior index number) of each training sample is M;
Sample set s (k) of the root node of every decision tree is the most identical with the sample number of training sample set S;
Step 112, random from M dimensional feature extract the m dimension indicator feature set as kth decision tree, wherein, the meter of m
Calculation formula may is that
Step 113, from the beginning of the root node of kth decision tree, according to Gini impurity level minimum principle, respectively calculate m dimension
The Gini impurity level of feature:Wherein, IG(jk) it is that kth is determined
The Gini impurity level of the jth dimensional feature of plan tree, Z1It is that sample set s (k) of root node divides through optimum y-bend under the characteristic condition of j
Every the sample set of rear produced left sibling, Z2It is that sample set s (k) of root node divides through optimum y-bend under the characteristic condition of j
Every the sample set of rear produced right node, N (Z1)、N(Z2) it is Z respectively1、Z2Sample size, Gini (Z1)、Gini(Z2) respectively
It is Z1、Z2Gini impurity level, and Gini (Z1) computing formula of (I=1 or 2) can also be further:I=0 or 1, when i=0 then represents non-harassing and wrecking number, when i=1 then represents harassing and wrecking number,
It is the Z in kth decision tree under jth characteristic condition1In respective branches, the harassing and wrecking number identified or non-harassing and wrecking number general
Rate;
Step 114, from the Gini impurity level of m dimensional feature, choose minima, and using minima characteristic of correspondence as root
Node, then this root node is split into left sibling and right node, then using root node as restrictive condition, continue to calculate described
The Gini impurity level of the m dimensional feature of node, and select wherein that minima characteristic of correspondence is as root node continued growth, with this type of
Push away, thus form decision tree, if wherein have the number of branch's Covering samples to be less than d, then the current of described branch is set
Node is leaf node, i.e. this node stops growing, and continues to train other nodes, until all nodes were all trained or quilt
It is labeled as leaf node.
As shown in Figure 4, Fig. 1 step 2 can further include:
Step 21, by number to be identified within a period of time communication behavior index input random forest disaggregated model, meter
The each leaf node calculating every decision tree treats the prediction probability of identification number generic:Wherein, i
=0 or 1, when i=0 is then expressed as non-harassing and wrecking number, when i=1 is then expressed as harassing number,It is the of kth decision tree
R leaf node is treated identification number and is belonged to the prediction probability of i-th classification,It is the r leaf of kth decision tree
Node belongs to the number number of i-th classification,It it is the number sum that comprises of the r leaf node of kth decision tree;
Step 22, calculate every decision tree and judge the prediction probability of number generic to be identified:
Wherein, PkI () is that kth decision tree judges that number to be identified belongs to the prediction probability of i-th classification, RkIt it is kth decision tree
Leaf node sum;
Step 23, calculate all decision trees judge number generic to be identified prediction probability sum:
Wherein, w (i) is that all decision trees judge that number to be identified belongs to the prediction probability sum of i-th classification, the most therefrom selects
Big value, number generic to be identified is i.e. the classification that maximum is corresponding, i.e. as w (0) > w (1), belonging to number the most to be identified
Classification is 0 (the most non-harassing and wrecking number), and as w (0) < w (1), number generic the most to be identified is 1 (i.e. harassing number).
Random forest K decision tree of disaggregated model stochastic generation, every decision tree includes multiple leaf node.For often
For the Subscriber Number of individual input, the leaf node wherein having can judge that Subscriber Number saves as harassing and wrecking number, some leaves
Point can judge Subscriber Number as non-harassing and wrecking number, according to all leaf nodes prediction probability to Subscriber Number generic, can
With obtain every decision tree judges Subscriber Number as harassing and wrecking number and the probability of non-harassing and wrecking number, and every decision tree judgement user
Number is harassing and wrecking number and the probability sum of non-harassing and wrecking number is 1, and such as, the 1st decision tree judges that Subscriber Number is as harassing number
The probability of code is 5/6, and the probability for non-harassing and wrecking number is then that the 1/6, the 2nd class decision tree judges that Subscriber Number is as harassing the general of number
Rate is 2/7, and the probability of non-harassing and wrecking number is 5/7 ...., the K decision tree judges the Subscriber Number probability as harassing and wrecking number
For=3/5, the probability of non-harassing and wrecking number is 2/5, then K decision tree determines that it is probability sum (the i.e. 5/6+ of harassing and wrecking number
2/7+...+3/5) and determine that it is the probability sum (i.e. 1/6+5/7+...+2/5) of non-harassing and wrecking number, if it is harassing and wrecking number
The probability sum of code is more than the probability sum of non-harassing and wrecking number, then input number is harassing and wrecking numbers, otherwise is then non-harassing and wrecking number.
Build random forest disaggregated model time, decision tree number, intrinsic dimensionality (i.e. communication behavior index number) and
The setting of the degree of depth equivalence of decision tree all can have influence on the recognition effect of random forest disaggregated model, in order to promote identification further
Effect, the present invention can also build multiple random forest disaggregated model (degree of depth of decision tree number, intrinsic dimensionality and decision tree
Value different), the most also build a test sample collection and respectively multiple random forest disaggregated models tested, and according to
Test result therefrom selects an optimum random forest disaggregated model.As it is shown in figure 5, the present invention can also include:
Step B1, concentrate from test sample and extract the communication behavior index of each test sample one by one, and will extract
All communication behavior indexs are input in each random forest disaggregated model, thus obtain each random forest disaggregated model to survey
The most whether sample is the result of determination harassing number;
Step B2, the harassing and wrecking number identified by each random forest disaggregated model are with confirmed harassing and wrecking number (such as
The harassing and wrecking number of Internet firm's mark, the harassing and wrecking number etc. of operator's report and complaint system mark) mate, calculate respectively
The accuracy rate of each random forest disaggregated model and recall rate;
Step B3, according to accuracy rate and recall rate, calculate the discrimination of each random forest disaggregated model:Wherein Precision is accuracy rate, and Recall is recall rate, and from all at random
The discrimination of forest classified model is selected F maximum, the random forest disaggregated model that described maximum is corresponding be i.e. optimum with
Machine forest classified model.
The foregoing is only presently preferred embodiments of the present invention, not in order to limit the present invention, all essences in the present invention
Within god and principle, any modification, equivalent substitution and improvement etc. done, within should be included in the scope of protection of the invention.
Claims (10)
1. the recognition methods harassing number, it is characterised in that include:
Step one, choose some confirmed harassing and wrecking and non-harassing and wrecking number, calculate described harassing and wrecking and non-harassing and wrecking number when one section
In communication behavior index, then described harassing and wrecking and non-harassing and wrecking number and communication behavior index thereof are formed training sample set
Building random forest disaggregated model, the input of described random forest disaggregated model is the communication behavior index of each Subscriber Number,
Output is that all decision trees determine that it is harassing and wrecking number and the prediction probability of non-harassing and wrecking number;
Step 2, by the number to be identified communication behavior index input random forest disaggregated model within a period of time, and calculate
In random forest disaggregated model, all decision trees determine that it is harassing and wrecking number and the prediction probability of non-harassing and wrecking number, to sentence accordingly
Whether fixed described number to be identified is harassing and wrecking numbers.
Method the most according to claim 1, it is characterised in that calculate Subscriber Number communication behavior within a period of time and refer to
Mark, has farther included:
Step A, collection user's history call bill data of continuous many days, choose multiple time granularity, then look for user each
High frequency talk period under time granularity, finally according to the communication row in user's high frequency talk period under each time granularity
User's communication behavior index under each time granularity, Subscriber Number communication behavior within a period of time is calculated for index
Index includes but not limited to: user's communication behavior index, Subscriber Number and ID under each time granularity.
Method the most according to claim 2, it is characterised in that described time granularity value but be not limited to: 1 minute, 5 points
Clock, 15 minutes, 30 minutes, 60 minutes, 180 minutes, 360 minutes, 720 minutes, 1440 minutes.
Method the most according to claim 2, it is characterised in that step A has farther included:
Step A1, extract the call bill data of user every day one by one;
Step A2, read beginning and ending time of this day call bill data, and calculate the maximum time granularity that the described beginning and ending time covered
Tmax, i.e. picks out its value maximum time grain less than the corresponding duration of described beginning and ending time from the multiple time granularities chosen
Degree;
Step A3, extract each time granularity one by one, and judge extracted time granularity whether less than or equal to Tmax, if
It is then the corresponding duration of the beginning and ending time of this day call bill data to be divided into multiple continuous print and with the time granularity extracted
For the communication time period of duration, then calculating user's call frequency in each communication time period, described user is when each communication
The call frequency in Duan that is to say the extracted time granularity call frequency at each communication time period of this day, under continuing to extract
One time granularity, until having extracted all time granularities;If it is not, then continue to extract next time granularity, until extracting
Complete all time granularities;
Step A4, judge whether to have extracted the call bill data in all skies, if it is, continue next step;If it is not, then continue to carry
Take the call bill data of next day of family, then turn to step A2;
Step A5, the call frequency of all communication time period in all skies, select maximum from each time granularity, described
The communication time period of big value correspondence is i.e. user's high-frequency communication period under this time granularity;
Step A6, calculating user's communication behavior index under each time granularity, that is to say that user is under each time granularity
High frequency talk period in communication behavior index.
Method the most according to claim 1, it is characterised in that communication behavior index includes but not limited to: the calling frequency, quilt
Cry number, the duration of call, ring duration, actively discharge number of times, passively release number of times, called dispersion, same caller called
Correlation coefficient between number, call identical No. ten thousand section maximum frequencys, caller accounting, call time separation standard poor, wherein:
Between the called number of same caller, correlation coefficient is that all called numbers that user called there are call each other
The ratio of all called numbers sum that the number of behavior and user called, calling identical No. ten thousand section maximum frequencys is that user exhales
That cried and belong to the called number maximum number of identical No. ten thousand sections, wherein, No. ten thousand sections are after called number is removed latter 4
Residue section.
Method the most according to claim 1, it is characterised in that in step one, by described harassing and wrecking and non-harassing and wrecking number and
Communication behavior index forms training sample set and builds random forest disaggregated model, has farther included:
Step 11, employing random forest disaggregated model, concentrate M communication behavior index of each training sample from training sample
Randomly select m communication behavior index to be trained to produce K character subset, thus generate K decision tree, every decision tree
Including self prediction probability to input number generic, wherein, decision tree uses the method for Gini impurity level to build.
Method the most according to claim 6, it is characterised in that in described step 11, for kth decision tree, it generates
Process has farther included:
Step 111, employing bootstrap method have the N number of sample of the extraction put back to constitute kth certainly from training sample set S
Sample set s (k) of the root node of plan tree, and set the minimum number of branch's Covering samples as d, wherein, N is training sample set S
Sample number, in training sample set S, the intrinsic dimensionality of each training sample is M;
Step 112, random from M dimensional feature extract the m dimension indicator feature set as kth decision tree;
Step 113, from the beginning of the root node of kth decision tree, according to Gini impurity level minimum principle, respectively calculate m dimensional feature
Gini impurity level:Wherein, IG(jk) it is kth decision tree
The Gini impurity level of jth dimensional feature, Z1Be sample set s (k) of root node under the characteristic condition of j after optimum y-bend separates institute
The sample set of the left sibling produced, Z2Be sample set s (k) of root node under the characteristic condition of j after optimum y-bend separates institute
The sample set of the right node produced, N (Z1)、N(Z2) it is Z respectively1、Z2Sample size, Gini (Z1)、Gini(Z2) it is Z respectively1、Z2
Gini impurity level;
Step 114, from the Gini impurity level of m dimensional feature, choose minima, and using minima characteristic of correspondence as root node,
This root node is split into left sibling and right node again, then using root node as restrictive condition, continues to calculate described root node
The Gini impurity level of m dimensional feature, and select wherein that minima characteristic of correspondence is as root node continued growth, by that analogy,
Thus form decision tree, if wherein have the number of branch's Covering samples to be less than d, then the present node of described branch is set
Stop growing for leaf node, i.e. this node, continue to train other nodes, until all nodes were all trained or were labeled
For leaf node.
Method the most according to claim 7, it is characterised in that the computing formula of m is:Gini(Zl)
Computing formula is:Wherein, l=1 or 2, i=0 or 1, when i=0 then represents non-harassing and wrecking number, work as i
=1 represents harassing and wrecking number,It is the Z in kth decision tree under jth characteristic conditionlIn respective branches, identified disturbs
Disturb number or the probability of non-harassing and wrecking number.
Method the most according to claim 1, it is characterised in that step 2 has farther included:
Step 21, by the number to be identified communication behavior index input random forest disaggregated model within a period of time, calculate every
Each leaf node of decision tree treats the prediction probability of identification number generic:Wherein, i=0 or
1, when i=0 is then expressed as non-harassing and wrecking number, when i=1 is then expressed as harassing number,It is the r leaf of kth decision tree
Child node is treated identification number and is belonged to the prediction probability of i-th classification,It is in the r leaf node of kth decision tree
Belong to the number number of i-th classification,It it is the number sum that comprises of the r leaf node of kth decision tree;
Step 22, calculate every decision tree and judge the prediction probability of number generic to be identified:Its
In, PkI () is that kth decision tree judges that number to be identified belongs to the prediction probability of i-th classification, RkIt it is the leaf of kth decision tree
Child node sum;
Step 23, calculate all decision trees judge number generic to be identified prediction probability sum:Its
In, w (i) is that all decision trees judge that number to be identified belongs to the prediction probability sum of i-th classification, the most therefrom selects maximum
Value, number generic to be identified is i.e. the classification that maximum is corresponding.
Method the most according to claim 1, it is characterised in that also include:
Step B1, concentrate from test sample and extract the communication behavior index of each test sample one by one, and all by extract
Communication behavior index is input in each random forest disaggregated model, thus obtains each random forest disaggregated model to test specimens
Whether this is the result of determination harassing number;
Step B2, the harassing and wrecking number identified by each random forest disaggregated model mate with confirmed harassing and wrecking number,
Calculate accuracy rate and the recall rate of each random forest disaggregated model respectively;
Step B3, according to accuracy rate and recall rate, calculate the discrimination of each random forest disaggregated model:
Wherein Precision is accuracy rate, and Recall is recall rate, and selects F from the discrimination of all random forest disaggregated models
Maximum, the random forest disaggregated model that described maximum is corresponding is i.e. optimal stochastic forest classified model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610710545.0A CN106255116A (en) | 2016-08-24 | 2016-08-24 | A kind of recognition methods harassing number |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610710545.0A CN106255116A (en) | 2016-08-24 | 2016-08-24 | A kind of recognition methods harassing number |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106255116A true CN106255116A (en) | 2016-12-21 |
Family
ID=57594647
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610710545.0A Pending CN106255116A (en) | 2016-08-24 | 2016-08-24 | A kind of recognition methods harassing number |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106255116A (en) |
Cited By (34)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106779868A (en) * | 2016-12-30 | 2017-05-31 | 中国民航信息网络股份有限公司 | Big customer's labeling method and device |
CN106982284A (en) * | 2017-04-12 | 2017-07-25 | 北京奇虎科技有限公司 | The recognition methods of harassing call number and device |
CN107133265A (en) * | 2017-03-31 | 2017-09-05 | 咪咕动漫有限公司 | A kind of method and device of identification behavior abnormal user |
CN107273531A (en) * | 2017-06-28 | 2017-10-20 | 百度在线网络技术(北京)有限公司 | Telephone number classifying identification method, device, equipment and storage medium |
CN107506776A (en) * | 2017-01-16 | 2017-12-22 | 恒安嘉新(北京)科技股份公司 | A kind of analysis method of fraudulent call number |
CN107733900A (en) * | 2017-10-23 | 2018-02-23 | 中国人民解放军信息工程大学 | One kind communication network users abnormal call behavioral value method for early warning |
CN108198086A (en) * | 2018-01-31 | 2018-06-22 | 海南海航信息技术有限公司 | For identifying the method and apparatus in harassing and wrecking source according to communication behavior feature |
CN108256542A (en) * | 2016-12-29 | 2018-07-06 | 北京搜狗科技发展有限公司 | A kind of feature of communication identifier determines method, apparatus and equipment |
CN108810230A (en) * | 2017-04-26 | 2018-11-13 | 腾讯科技(深圳)有限公司 | A kind of method, apparatus and equipment obtaining incoming call prompting information |
CN108989581A (en) * | 2018-09-21 | 2018-12-11 | 中国银行股份有限公司 | A kind of consumer's risk recognition methods, apparatus and system |
CN109241418A (en) * | 2018-08-22 | 2019-01-18 | 中国平安人寿保险股份有限公司 | Abnormal user recognition methods and device, equipment, medium based on random forest |
CN109429230A (en) * | 2017-08-28 | 2019-03-05 | 中国移动通信集团浙江有限公司 | A kind of communication swindle recognition methods and system |
CN109474756A (en) * | 2018-11-16 | 2019-03-15 | 国家计算机网络与信息安全管理中心 | A kind of telecommunications method for detecting abnormality indicating study based on contract network |
CN109525739A (en) * | 2018-12-25 | 2019-03-26 | 亚信科技(中国)有限公司 | A kind of telephone number recognition methods, device and server |
CN109547393A (en) * | 2017-09-21 | 2019-03-29 | 腾讯科技(深圳)有限公司 | Malice number identification method, device, equipment and storage medium |
CN109587357A (en) * | 2018-11-14 | 2019-04-05 | 上海麦图信息科技有限公司 | A kind of recognition methods of harassing call |
CN109688275A (en) * | 2018-12-27 | 2019-04-26 | 中国联合网络通信集团有限公司 | Harassing call recognition methods, device and storage medium |
CN109995924A (en) * | 2017-12-30 | 2019-07-09 | 中国移动通信集团贵州有限公司 | Cheat phone recognition methods, device, equipment and medium |
CN110147430A (en) * | 2019-04-25 | 2019-08-20 | 上海欣方智能系统有限公司 | Harassing call recognition methods and system based on random forests algorithm |
CN110177179A (en) * | 2019-05-16 | 2019-08-27 | 国家计算机网络与信息安全管理中心 | A kind of swindle number identification method based on figure insertion |
CN110275956A (en) * | 2019-06-24 | 2019-09-24 | 成都数之联科技有限公司 | A kind of personal identification method and system |
CN110351731A (en) * | 2018-04-08 | 2019-10-18 | 中兴通讯股份有限公司 | A kind of method and device of phone number antifraud |
CN110414543A (en) * | 2018-04-28 | 2019-11-05 | 中国移动通信集团有限公司 | A kind of method of discrimination, equipment and the computer storage medium of telephone number danger level |
CN110505353A (en) * | 2019-08-30 | 2019-11-26 | 北京泰迪熊移动科技有限公司 | A kind of number identification method, equipment and computer storage medium |
CN111062422A (en) * | 2019-11-29 | 2020-04-24 | 上海观安信息技术股份有限公司 | Method and device for systematic identification of road loan |
CN111104521A (en) * | 2019-12-18 | 2020-05-05 | 上海观安信息技术股份有限公司 | Anti-fraud detection method and detection system based on graph analysis |
CN111126434A (en) * | 2019-11-19 | 2020-05-08 | 山东省科学院激光研究所 | Automatic microseism first arrival time picking method and system based on random forest |
CN111432080A (en) * | 2018-12-24 | 2020-07-17 | 北京奇虎科技有限公司 | Ticket data processing method, electronic equipment and computer readable storage medium |
CN111885270A (en) * | 2020-07-09 | 2020-11-03 | 恒安嘉新(北京)科技股份公司 | Abnormal communication detection method, device, equipment and storage medium |
CN111918226A (en) * | 2020-07-23 | 2020-11-10 | 广州市申迪计算机系统有限公司 | Real-time signaling-based method and device for analyzing international high-settlement embezzlement behavior |
CN113709747A (en) * | 2020-05-09 | 2021-11-26 | 中国移动通信集团有限公司 | Harassment number identification method and device, computer equipment and storage medium |
CN113946720A (en) * | 2020-07-17 | 2022-01-18 | 中国移动通信集团广东有限公司 | Method and device for identifying users in group and electronic equipment |
CN114449106A (en) * | 2022-02-10 | 2022-05-06 | 恒安嘉新(北京)科技股份公司 | Abnormal telephone number identification method, device, equipment and storage medium |
CN114979369A (en) * | 2022-04-14 | 2022-08-30 | 马上消费金融股份有限公司 | Abnormal call detection method and device, electronic equipment and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1805428A (en) * | 2006-01-24 | 2006-07-19 | 陈永霞 | Number sorted communication network technique |
CN104023109A (en) * | 2014-06-27 | 2014-09-03 | 深圳市中兴移动通信有限公司 | Incoming call prompt method and device as well as incoming call classifying method and device |
WO2015062209A1 (en) * | 2013-10-29 | 2015-05-07 | 华为技术有限公司 | Visualized optimization processing method and device for random forest classification model |
CN104717674A (en) * | 2014-12-02 | 2015-06-17 | 北京奇虎科技有限公司 | Number attribute recognition method and device, terminal and server |
CN105718490A (en) * | 2014-12-04 | 2016-06-29 | 阿里巴巴集团控股有限公司 | Method and device for updating classifying model |
-
2016
- 2016-08-24 CN CN201610710545.0A patent/CN106255116A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1805428A (en) * | 2006-01-24 | 2006-07-19 | 陈永霞 | Number sorted communication network technique |
WO2015062209A1 (en) * | 2013-10-29 | 2015-05-07 | 华为技术有限公司 | Visualized optimization processing method and device for random forest classification model |
CN104023109A (en) * | 2014-06-27 | 2014-09-03 | 深圳市中兴移动通信有限公司 | Incoming call prompt method and device as well as incoming call classifying method and device |
CN104717674A (en) * | 2014-12-02 | 2015-06-17 | 北京奇虎科技有限公司 | Number attribute recognition method and device, terminal and server |
CN105718490A (en) * | 2014-12-04 | 2016-06-29 | 阿里巴巴集团控股有限公司 | Method and device for updating classifying model |
Cited By (49)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108256542A (en) * | 2016-12-29 | 2018-07-06 | 北京搜狗科技发展有限公司 | A kind of feature of communication identifier determines method, apparatus and equipment |
CN106779868A (en) * | 2016-12-30 | 2017-05-31 | 中国民航信息网络股份有限公司 | Big customer's labeling method and device |
CN107506776A (en) * | 2017-01-16 | 2017-12-22 | 恒安嘉新(北京)科技股份公司 | A kind of analysis method of fraudulent call number |
CN107133265A (en) * | 2017-03-31 | 2017-09-05 | 咪咕动漫有限公司 | A kind of method and device of identification behavior abnormal user |
CN107133265B (en) * | 2017-03-31 | 2021-07-09 | 咪咕动漫有限公司 | Method and device for identifying user with abnormal behavior |
CN106982284A (en) * | 2017-04-12 | 2017-07-25 | 北京奇虎科技有限公司 | The recognition methods of harassing call number and device |
CN108810230A (en) * | 2017-04-26 | 2018-11-13 | 腾讯科技(深圳)有限公司 | A kind of method, apparatus and equipment obtaining incoming call prompting information |
CN107273531A (en) * | 2017-06-28 | 2017-10-20 | 百度在线网络技术(北京)有限公司 | Telephone number classifying identification method, device, equipment and storage medium |
CN107273531B (en) * | 2017-06-28 | 2021-01-08 | 百度在线网络技术(北京)有限公司 | Telephone number classification identification method, device, equipment and storage medium |
CN109429230B (en) * | 2017-08-28 | 2022-01-25 | 中国移动通信集团浙江有限公司 | Communication fraud identification method and system |
CN109429230A (en) * | 2017-08-28 | 2019-03-05 | 中国移动通信集团浙江有限公司 | A kind of communication swindle recognition methods and system |
CN109547393A (en) * | 2017-09-21 | 2019-03-29 | 腾讯科技(深圳)有限公司 | Malice number identification method, device, equipment and storage medium |
CN107733900A (en) * | 2017-10-23 | 2018-02-23 | 中国人民解放军信息工程大学 | One kind communication network users abnormal call behavioral value method for early warning |
CN107733900B (en) * | 2017-10-23 | 2019-10-29 | 中国人民解放军信息工程大学 | A kind of communication network users abnormal call behavioral value method for early warning |
CN109995924A (en) * | 2017-12-30 | 2019-07-09 | 中国移动通信集团贵州有限公司 | Cheat phone recognition methods, device, equipment and medium |
CN108198086B (en) * | 2018-01-31 | 2021-06-25 | 海南海航信息技术有限公司 | Method and device for identifying disturbance source according to communication behavior characteristics |
CN108198086A (en) * | 2018-01-31 | 2018-06-22 | 海南海航信息技术有限公司 | For identifying the method and apparatus in harassing and wrecking source according to communication behavior feature |
CN110351731A (en) * | 2018-04-08 | 2019-10-18 | 中兴通讯股份有限公司 | A kind of method and device of phone number antifraud |
CN110414543A (en) * | 2018-04-28 | 2019-11-05 | 中国移动通信集团有限公司 | A kind of method of discrimination, equipment and the computer storage medium of telephone number danger level |
CN109241418B (en) * | 2018-08-22 | 2024-04-09 | 中国平安人寿保险股份有限公司 | Abnormal user identification method and device based on random forest, equipment and medium |
CN109241418A (en) * | 2018-08-22 | 2019-01-18 | 中国平安人寿保险股份有限公司 | Abnormal user recognition methods and device, equipment, medium based on random forest |
CN108989581B (en) * | 2018-09-21 | 2022-03-22 | 中国银行股份有限公司 | User risk identification method, device and system |
CN108989581A (en) * | 2018-09-21 | 2018-12-11 | 中国银行股份有限公司 | A kind of consumer's risk recognition methods, apparatus and system |
CN109587357A (en) * | 2018-11-14 | 2019-04-05 | 上海麦图信息科技有限公司 | A kind of recognition methods of harassing call |
CN109587357B (en) * | 2018-11-14 | 2021-04-06 | 上海麦图信息科技有限公司 | Crank call identification method |
CN109474756A (en) * | 2018-11-16 | 2019-03-15 | 国家计算机网络与信息安全管理中心 | A kind of telecommunications method for detecting abnormality indicating study based on contract network |
CN109474756B (en) * | 2018-11-16 | 2020-09-22 | 国家计算机网络与信息安全管理中心 | Telecommunication anomaly detection method based on collaborative network representation learning |
CN111432080A (en) * | 2018-12-24 | 2020-07-17 | 北京奇虎科技有限公司 | Ticket data processing method, electronic equipment and computer readable storage medium |
CN109525739A (en) * | 2018-12-25 | 2019-03-26 | 亚信科技(中国)有限公司 | A kind of telephone number recognition methods, device and server |
CN109688275A (en) * | 2018-12-27 | 2019-04-26 | 中国联合网络通信集团有限公司 | Harassing call recognition methods, device and storage medium |
CN110147430A (en) * | 2019-04-25 | 2019-08-20 | 上海欣方智能系统有限公司 | Harassing call recognition methods and system based on random forests algorithm |
CN110177179A (en) * | 2019-05-16 | 2019-08-27 | 国家计算机网络与信息安全管理中心 | A kind of swindle number identification method based on figure insertion |
CN110177179B (en) * | 2019-05-16 | 2020-12-29 | 国家计算机网络与信息安全管理中心 | Fraud number identification method based on graph embedding |
CN110275956A (en) * | 2019-06-24 | 2019-09-24 | 成都数之联科技有限公司 | A kind of personal identification method and system |
CN110505353A (en) * | 2019-08-30 | 2019-11-26 | 北京泰迪熊移动科技有限公司 | A kind of number identification method, equipment and computer storage medium |
CN111126434A (en) * | 2019-11-19 | 2020-05-08 | 山东省科学院激光研究所 | Automatic microseism first arrival time picking method and system based on random forest |
CN111062422B (en) * | 2019-11-29 | 2023-07-14 | 上海观安信息技术股份有限公司 | Method and device for identifying set-way loan system |
CN111062422A (en) * | 2019-11-29 | 2020-04-24 | 上海观安信息技术股份有限公司 | Method and device for systematic identification of road loan |
CN111104521A (en) * | 2019-12-18 | 2020-05-05 | 上海观安信息技术股份有限公司 | Anti-fraud detection method and detection system based on graph analysis |
CN111104521B (en) * | 2019-12-18 | 2023-10-17 | 上海观安信息技术股份有限公司 | Anti-fraud detection method and detection system based on graph analysis |
CN113709747A (en) * | 2020-05-09 | 2021-11-26 | 中国移动通信集团有限公司 | Harassment number identification method and device, computer equipment and storage medium |
CN113709747B (en) * | 2020-05-09 | 2023-10-13 | 中国移动通信集团有限公司 | Harassment number identification method and device, computer equipment and storage medium |
CN111885270A (en) * | 2020-07-09 | 2020-11-03 | 恒安嘉新(北京)科技股份公司 | Abnormal communication detection method, device, equipment and storage medium |
CN111885270B (en) * | 2020-07-09 | 2021-08-24 | 恒安嘉新(北京)科技股份公司 | Abnormal communication detection method, device, equipment and storage medium |
CN113946720A (en) * | 2020-07-17 | 2022-01-18 | 中国移动通信集团广东有限公司 | Method and device for identifying users in group and electronic equipment |
CN111918226A (en) * | 2020-07-23 | 2020-11-10 | 广州市申迪计算机系统有限公司 | Real-time signaling-based method and device for analyzing international high-settlement embezzlement behavior |
CN114449106A (en) * | 2022-02-10 | 2022-05-06 | 恒安嘉新(北京)科技股份公司 | Abnormal telephone number identification method, device, equipment and storage medium |
CN114449106B (en) * | 2022-02-10 | 2024-04-30 | 恒安嘉新(北京)科技股份公司 | Method, device, equipment and storage medium for identifying abnormal telephone number |
CN114979369A (en) * | 2022-04-14 | 2022-08-30 | 马上消费金融股份有限公司 | Abnormal call detection method and device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106255116A (en) | A kind of recognition methods harassing number | |
CN109766872B (en) | Image recognition method and device | |
CN109600752A (en) | A kind of method and apparatus of depth cluster swindle detection | |
CN105260628B (en) | Classifier training method and apparatus, auth method and system | |
CN104732978B (en) | The relevant method for distinguishing speek person of text based on combined depth study | |
WO2017143932A1 (en) | Fraudulent transaction detection method based on sample clustering | |
CN104410973B (en) | A kind of fraudulent call recognition methods of playback and system | |
CN107133265A (en) | A kind of method and device of identification behavior abnormal user | |
CN103258535A (en) | Identity recognition method and system based on voiceprint recognition | |
CN110353673A (en) | A kind of brain electric channel selection method based on standard mutual information | |
CN106453971B (en) | The acquisition methods and call center's quality inspection system of call center's quality inspection voice | |
CN109034194A (en) | Transaction swindling behavior depth detection method based on feature differentiation | |
CN106843941B (en) | Information processing method, device and computer equipment | |
CN110248322A (en) | A kind of swindling gang identifying system and recognition methods based on fraud text message | |
CN110084149A (en) | A kind of face verification method based on difficult sample four-tuple dynamic boundary loss function | |
CN106601243A (en) | Video file identification method and device | |
CN106469181A (en) | A kind of user behavior pattern analysis method and device | |
CN113221673B (en) | Speaker authentication method and system based on multi-scale feature aggregation | |
CN108536866B (en) | Microblog hidden key user analysis method based on topic transfer entropy | |
CN109684374A (en) | A kind of extracting method and device of the key-value pair of time series data | |
CN109903053A (en) | A kind of anti-fraud method carrying out Activity recognition based on sensing data | |
CN113961712A (en) | Knowledge graph-based fraud telephone analysis method | |
CN107704631B (en) | Crowdsourcing-based music annotation atom library construction method | |
CN109493882A (en) | A kind of fraudulent call voice automatic marking system and method | |
CN110458094A (en) | Device class method based on fingerprint similarity |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20161221 |
|
RJ01 | Rejection of invention patent application after publication |