CN106255116A - A kind of recognition methods harassing number - Google Patents

A kind of recognition methods harassing number Download PDF

Info

Publication number
CN106255116A
CN106255116A CN201610710545.0A CN201610710545A CN106255116A CN 106255116 A CN106255116 A CN 106255116A CN 201610710545 A CN201610710545 A CN 201610710545A CN 106255116 A CN106255116 A CN 106255116A
Authority
CN
China
Prior art keywords
harassing
wrecking
time
random forest
decision tree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610710545.0A
Other languages
Chinese (zh)
Inventor
王瀚辰
王彦青
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN201610710545.0A priority Critical patent/CN106255116A/en
Publication of CN106255116A publication Critical patent/CN106255116A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W12/00Security arrangements; Authentication; Protecting privacy or anonymity
    • H04W12/12Detection or prevention of fraud
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/22Arrangements for supervision, monitoring or testing
    • H04M3/2281Call monitoring, e.g. for law enforcement purposes; Call tracing; Detection or prevention of malicious calls

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Signal Processing (AREA)
  • Technology Law (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Telephonic Communication Services (AREA)

Abstract

A kind of recognition methods harassing number, include: choose some confirmed harassing and wrecking and non-harassing and wrecking number, calculate described harassing and wrecking and non-harassing and wrecking number communication behavior index within a period of time, then described harassing and wrecking and non-harassing and wrecking number and communication behavior index thereof are formed training sample set and build random forest disaggregated model, the input of described random forest disaggregated model is the communication behavior index of each Subscriber Number, and output is that all decision trees determine that it is harassing and wrecking number and the prediction probability of non-harassing and wrecking number;By the number to be identified communication behavior index input random forest disaggregated model within a period of time, and calculate all decision trees in random forest disaggregated model and determine that it is harassing and wrecking number and the prediction probability of non-harassing and wrecking number, to judge whether described number to be identified is harassing and wrecking numbers accordingly.The invention belongs to network communication technology field, the call features of calling and called number can be made full use of, from the magnanimity traffic data of existing network, effectively identify harassing and wrecking number.

Description

A kind of recognition methods harassing number
Technical field
The present invention relates to a kind of recognition methods harassing number, belong to network communication technology field.
Background technology
Harassing call, to promote ad content, swindle information, has become as the illegal occupation disturbed social tranquility.Logical Crossing comprehensive analysis, harassing call generally has the following characteristics that
1, called dispersion, harassing and wrecking number breathes out multiple number within the unit interval, and frequency is high, and between each called number Dependency is less;
2, harassing call and called between usual dependency more weak, i.e. history call relation is little, and it is usual to harass number The quantity initiating calling as caller is far longer than it as called quantity;
3, the duration of call of harassing call is the shortest, and the probability of Called Onhook is bigger;
4, harassing call is generally of calling frequency height and integrated distribution in the feature of certain time period.
Patent application CN200910079707.5 (application title: the recognition methods of a kind of harassing call and device, application Day: 2009-03-06, applicant: ZTE Co., Ltd) disclose recognition methods and the device of a kind of harassing call, Introduce the identifying processing to strange telephone number in the mobile phone, by interval to call time of Stranger Calls number, Calling duration length and the statistics of incoming call number of times, automatically compare with the judgment rule of user, identify harassing and wrecking Phone.This technical scheme relates only to the statistics of interval call time, calling duration length and incoming call number of times and knows Do not harass number, it is judged that method is very simple, and the call features underusing calling and called number to talk about from the magnanimity of existing network Business data effectively identify harassing and wrecking number.
Therefore, how to make full use of the call features of calling and called number, effectively identify from the magnanimity traffic data of existing network Harassing and wrecking number, is still a technical problem being worth further investigation.
Summary of the invention
In view of this, it is an object of the invention to provide a kind of recognition methods harassing number, calling and called number can be made full use of The call features of code, effectively identifies harassing and wrecking number from the magnanimity traffic data of existing network.
In order to achieve the above object, the invention provides a kind of recognition methods harassing number, include:
Step one, choose some confirmed harassing and wrecking and non-harassing and wrecking number, calculate described harassing and wrecking and non-harassing and wrecking number one Communication behavior index in the section time, then forms training sample by described harassing and wrecking and non-harassing and wrecking number and communication behavior index thereof Collection builds random forest disaggregated model, and the input of described random forest disaggregated model is that the communication behavior of each Subscriber Number refers to Mark, output is that all decision trees determine that it is harassing and wrecking number and the prediction probability of non-harassing and wrecking number;
Step 2, by number to be identified within a period of time communication behavior index input random forest disaggregated model, and Calculate all decision trees in random forest disaggregated model and determine that it is harassing and wrecking number and the prediction probability of non-harassing and wrecking number, with accordingly Judge whether described number to be identified is harassing and wrecking numbers.
Compared with prior art, the invention has the beneficial effects as follows: call dispersion, called relation loop, exhalation incoming call ratio, exhale Making the communication behavior indexs such as Annual distribution can effectively embody the behavioral characteristic of harassing and wrecking number, the present invention uses random forest Disaggregated model, to call the frequency, called number, the duration of call, ring duration, actively to discharge number of times, passively release number of times, called Dispersion, same caller called number between correlation coefficient, call identical No. ten thousand the section maximum frequencys, caller accounting, call times Multiple communication behavior indexs such as separation standard difference are as input, and are judged to harass number and non-harassing and wrecking number according to all decision trees The probability of code identifies harassing and wrecking number, it is thus possible to utilize the call features of calling and called number, fully excavates in a large amount of training samples Data characteristics, from the magnanimity traffic data of existing network, effectively identify harassing and wrecking number, and communication behavior index can also basis It is actually needed and is adjusted flexibly;Owing to harassing call has, the calling frequency is high and integrated distribution is in certain time period, this The call bill data of whole day is divided into the communication time period with multiple time granularities as duration by invention the most further, and during based on difference Between high-frequency communication period under granularity calculate the various communication behavior indexs of Subscriber Number, it is thus possible to improve harassing and wrecking number further Code identify quasi real time and high efficiency;The present invention can also build multiple random forest disaggregated model, and obtains according to after test The discrimination of random forest disaggregated model therefrom select an optimum random forest disaggregated model.
Accompanying drawing explanation
Fig. 1 is a kind of recognition methods flow chart harassing number of the present invention.
Fig. 2 is the concrete operations flow chart of step A.
Fig. 3 is in step 11, and for kth decision tree, k=1,2 ..., K, it generates the concrete operations flow process of process Figure.
Fig. 4 is the concrete operations flow chart of Fig. 1 step 2.
Fig. 5 is that the present invention builds a test sample collection and tests multiple random forest disaggregated models respectively, and root The concrete operations flow chart of an optimum random forest disaggregated model is therefrom selected according to test result.
Detailed description of the invention
For making the object, technical solutions and advantages of the present invention clearer, below in conjunction with the accompanying drawings the present invention is made further Detailed description.
As it is shown in figure 1, a kind of recognition methods harassing number of the present invention, include:
Step one, choose some confirmed harassing and wrecking and non-harassing and wrecking number, calculate described harassing and wrecking and non-harassing and wrecking number one Communication behavior index in the section time, then forms training sample by described harassing and wrecking and non-harassing and wrecking number and communication behavior index thereof Collection builds random forest disaggregated model, and the input of described random forest disaggregated model is that the communication behavior of each Subscriber Number refers to Mark, output is that all decision trees determine that it is harassing and wrecking number and the prediction probability of non-harassing and wrecking number;
Step 2, by number to be identified within a period of time communication behavior index input random forest disaggregated model, and Calculate all decision trees in random forest disaggregated model and determine that it is harassing and wrecking number and the prediction probability of non-harassing and wrecking number, with accordingly Judge whether described number to be identified is harassing and wrecking numbers.
In step one, can by history it has been acknowledged that harassing and wrecking number (such as by Internet firm obtain or pass through The harassing and wrecking number etc. of operator report and complaint system mark) blacklist and white list workbook choose confirmed harassing and wrecking With non-harassing and wrecking number, then by the way of signal collecting, gather call event letter from equipment such as signaling monitoring system or A mouths Make call bill data or gather history call bill data from BOSS, thus obtaining and above-mentioned choose number communication within a period of time Information, and the communications records that wrong for wherein critical field data form or critical field data exist vacancy value reject.
Owing to harassing call has, the calling frequency is high and integrated distribution is in certain time period, simultaneously in order to further Improve harassing and wrecking Number Reorganization quasi real time and high efficiency, the present invention is also based on different time granularities and calculates user respectively The communication behavior index of number, described time granularity can with value but be not limited to: 1 minute, 5 minutes, 15 minutes, 30 minutes, 60 Minute, 180 minutes, 360 minutes, 720 minutes, 1440 minutes, so, Subscriber Number communication behavior index within a period of time User's communication behavior index, Subscriber Number and ID etc. under each time granularity, wherein, Yong Hubiao can be included Know be used for identifying Subscriber Number be whether harassing and wrecking number (be such as harassing and wrecking numbers by the number mark chosen from blacklist, from The number mark chosen in white list is non-harassing and wrecking number).In the present invention, calculate Subscriber Number based on time granularity and (include Harassing and wrecking and non-harassing and wrecking number and number to be identified) communication behavior index within a period of time, it is also possible to farther include Have:
Step A, collection user's history call bill data of continuous many days, choose multiple time granularity, then look for user and exist High frequency talk period under each time granularity, finally according to leading in user's high frequency talk period under each time granularity Letter behavioral indicator calculates user's communication behavior index under each time granularity.
As in figure 2 it is shown, step A can further include:
Step A1, extract the call bill data of user every day one by one;
Step A2, read beginning and ending time of this day call bill data, and calculate the maximum time that the described beginning and ending time covered Granularity Tmax, when i.e. picking out its value less than corresponding duration maximum of described beginning and ending time from the multiple time granularities chosen Between granularity;
If the call bill data gathered exists disappearance due to loss or other reasons, the most only collect 12:00--- History call bill data between 24:00, then maximum duration granularity Tmax in the beginning and ending time that the present invention only retains call bill data Interior all time granularities;
Step A3, extract each time granularity one by one, and judge extracted time granularity whether less than or equal to Tmax, If it is, the corresponding duration of the beginning and ending time of this day call bill data is divided into multiple continuous print and with the time extracted Granularity is the communication time period of duration, then calculates user's call frequency in each communication time period, and described user is each logical The call frequency in the letter period that is to say the extracted time granularity call frequency at each communication time period of this day, continues to carry Take next time granularity, until having extracted all time granularities;If it is not, then continue to extract next time granularity, until All time granularities are extracted;
The time granularity extracted must be less than or equal to Tmax, when such as Tmax=30 minute, then and the time grain extracted Degree is respectively 1 minute, 5 minutes, 15 minutes, 30 minutes, and the time granularity more than Tmax does not the most remake calculating further, when one The beginning and ending time of it call bill data is 0:00--24:00, when the time granularity T extracted is 30 minutes, and the communication being divided into Period is respectively as follows: 0:00--0:30,0:31--1:00 ...., 23:01--23:30,23:31--24:00, and temporally granularity The communication time period divided is all from the beginning of the 00 of this communication time period second, to terminating for first 1 minute the 59th second of next communication time period;
Step A4, judge whether to have extracted the call bill data in all skies?If it is, continue next step;If it is not, then continue The continuous call bill data extracting next day of user, then turns to step A2;
Step A5, the call frequency of all communication time period in all skies, select maximum from each time granularity, institute Stating communication time period corresponding to maximum is i.e. user's high-frequency communication period under this time granularity, and namely user is the most The communication time period that the frequency is the highest and concentrates is called in it;
Step A6, calculating user's communication behavior index under each time granularity, that is to say that user is at each time grain The communication behavior index in high frequency talk period under Du, described communication behavior index can include but not limited to: the calling frequency, Called number, the duration of call, ring duration, actively discharge number of times, passively release number of times, called dispersion, the quilt of same caller Call out the numbers intersymbol correlation coefficient, call identical No. ten thousand section maximum frequencys, caller accounting (i.e. breathing out number of times/exhalation incoming call total degree), Separation standard call time poor (calculate this index called number and need 3 or more than 3) etc..Wherein, the called number of same caller Intersymbol correlation coefficient is number and the user's calling that all called numbers that user called there are call behavior each other The ratio of all called numbers sum crossed, such as, user A have called 100 quilts in the high frequency talk period of time granularity T Calling out the numbers code, in called number at the appointed time section, (as within the training period of the 1-5 days) has between 4 called numbers B, C, D, E There are call behavior and (such as co-exist in the calling having duration of call >=0 5 times: B-> C, D-> E, C-> B, C-> D, D- > C), between the called number of the most same caller (i.e. user A), correlation coefficient is: 4/100;Calling identical No. ten thousand section maximum frequencys is That user called and belong to the called number maximum number of identical No. ten thousand sections, No. ten thousand sections are after called number is removed latter 4 Residue section, such as, No. ten thousand sections of called number of user's calling have: 1395193,1395193,1390123,1390438, The called number quantity of identical No. ten thousand sections is respectively 2,1,1, then calling identical No. ten thousand section maximum frequencys is 2.
In step A6, when user's communication behavior index number under each time granularity is a, a time granularity chosen Number is b, then feature (the i.e. communication behavior index) number of the training sample each user of concentration of random forest disaggregated model can be: M=a*b+2.
Random forest disaggregated model has can process higher-dimension attribute data, without doing, feature selection, training speed be fast, instruction Influencing each other between attribute can be detected during white silk, can to realize parallelization, exportable Importance of attribute degree and classification general The advantages such as rate and prediction classification, therefore can choose random forest disaggregated model for identifying harassing and wrecking number.
In the present invention, the basic thought of random forest disaggregated model is: first, utilize have the arbitrary sampling method put back to from Extraction k group sample in original training set (N number of sample M dimension attribute), and often organize the sample size of sample all with original training set phase With, it is N;Secondly, randomly selecting m dimension attribute, m value is less than or equal to total attribute dimension M;Then, N number of sample m dimension is belonged to every time Property generate a decision tree, vertical K decision-tree model of building together, obtain K kind classification results;Finally, according to K kind classification results to often Individual record is voted, and determines that it is finally classified.Therefore, the structure of random forest disaggregated model mainly has two parts, and one Part is the structure of decision tree, formation decision tree forest, and the method that in the present invention, decision tree can use Gini impurity level is impure Spending the least, attribute is the most important;Another part is decision making process, uses the mode of voting to export optimal classification result.So, step In rapid one, described harassing and wrecking and non-harassing and wrecking number and communication behavior index thereof are formed training sample set and builds random forest classification Model, it is also possible to farther included:
Step 11, employing random forest disaggregated model, concentrate M communication behavior of each training sample to refer to from training sample Randomly selecting m communication behavior index in mark to be trained to produce K character subset, thus generate K decision tree, every certainly Plan tree includes self prediction probability to input number generic.Wherein, decision tree can use the side of Gini impurity level Method, impurity level is the least, then feature is the most important.
As it is shown on figure 3, in described step 11, for kth decision tree, k=1,2 ..., K, it is all right that it generates process Farther include:
Step 111, employing bootstrap method have the N number of sample of the extraction put back to constitute kth from training sample set S Sample set s (k) of the root node of decision tree, and set the minimum number of branch's Covering samples as d, wherein, N is training sample The sample number of collection S, in training sample set S, the intrinsic dimensionality (i.e. communication behavior index number) of each training sample is M;
Sample set s (k) of the root node of every decision tree is the most identical with the sample number of training sample set S;
Step 112, random from M dimensional feature extract the m dimension indicator feature set as kth decision tree, wherein, the meter of m Calculation formula may is that
Step 113, from the beginning of the root node of kth decision tree, according to Gini impurity level minimum principle, respectively calculate m dimension The Gini impurity level of feature:Wherein, IG(jk) it is that kth is determined The Gini impurity level of the jth dimensional feature of plan tree, Z1It is that sample set s (k) of root node divides through optimum y-bend under the characteristic condition of j Every the sample set of rear produced left sibling, Z2It is that sample set s (k) of root node divides through optimum y-bend under the characteristic condition of j Every the sample set of rear produced right node, N (Z1)、N(Z2) it is Z respectively1、Z2Sample size, Gini (Z1)、Gini(Z2) respectively It is Z1、Z2Gini impurity level, and Gini (Z1) computing formula of (I=1 or 2) can also be further:I=0 or 1, when i=0 then represents non-harassing and wrecking number, when i=1 then represents harassing and wrecking number, It is the Z in kth decision tree under jth characteristic condition1In respective branches, the harassing and wrecking number identified or non-harassing and wrecking number general Rate;
Step 114, from the Gini impurity level of m dimensional feature, choose minima, and using minima characteristic of correspondence as root Node, then this root node is split into left sibling and right node, then using root node as restrictive condition, continue to calculate described The Gini impurity level of the m dimensional feature of node, and select wherein that minima characteristic of correspondence is as root node continued growth, with this type of Push away, thus form decision tree, if wherein have the number of branch's Covering samples to be less than d, then the current of described branch is set Node is leaf node, i.e. this node stops growing, and continues to train other nodes, until all nodes were all trained or quilt It is labeled as leaf node.
As shown in Figure 4, Fig. 1 step 2 can further include:
Step 21, by number to be identified within a period of time communication behavior index input random forest disaggregated model, meter The each leaf node calculating every decision tree treats the prediction probability of identification number generic:Wherein, i =0 or 1, when i=0 is then expressed as non-harassing and wrecking number, when i=1 is then expressed as harassing number,It is the of kth decision tree R leaf node is treated identification number and is belonged to the prediction probability of i-th classification,It is the r leaf of kth decision tree Node belongs to the number number of i-th classification,It it is the number sum that comprises of the r leaf node of kth decision tree;
Step 22, calculate every decision tree and judge the prediction probability of number generic to be identified: Wherein, PkI () is that kth decision tree judges that number to be identified belongs to the prediction probability of i-th classification, RkIt it is kth decision tree Leaf node sum;
Step 23, calculate all decision trees judge number generic to be identified prediction probability sum: Wherein, w (i) is that all decision trees judge that number to be identified belongs to the prediction probability sum of i-th classification, the most therefrom selects Big value, number generic to be identified is i.e. the classification that maximum is corresponding, i.e. as w (0) > w (1), belonging to number the most to be identified Classification is 0 (the most non-harassing and wrecking number), and as w (0) < w (1), number generic the most to be identified is 1 (i.e. harassing number).
Random forest K decision tree of disaggregated model stochastic generation, every decision tree includes multiple leaf node.For often For the Subscriber Number of individual input, the leaf node wherein having can judge that Subscriber Number saves as harassing and wrecking number, some leaves Point can judge Subscriber Number as non-harassing and wrecking number, according to all leaf nodes prediction probability to Subscriber Number generic, can With obtain every decision tree judges Subscriber Number as harassing and wrecking number and the probability of non-harassing and wrecking number, and every decision tree judgement user Number is harassing and wrecking number and the probability sum of non-harassing and wrecking number is 1, and such as, the 1st decision tree judges that Subscriber Number is as harassing number The probability of code is 5/6, and the probability for non-harassing and wrecking number is then that the 1/6, the 2nd class decision tree judges that Subscriber Number is as harassing the general of number Rate is 2/7, and the probability of non-harassing and wrecking number is 5/7 ...., the K decision tree judges the Subscriber Number probability as harassing and wrecking number For=3/5, the probability of non-harassing and wrecking number is 2/5, then K decision tree determines that it is probability sum (the i.e. 5/6+ of harassing and wrecking number 2/7+...+3/5) and determine that it is the probability sum (i.e. 1/6+5/7+...+2/5) of non-harassing and wrecking number, if it is harassing and wrecking number The probability sum of code is more than the probability sum of non-harassing and wrecking number, then input number is harassing and wrecking numbers, otherwise is then non-harassing and wrecking number.
Build random forest disaggregated model time, decision tree number, intrinsic dimensionality (i.e. communication behavior index number) and The setting of the degree of depth equivalence of decision tree all can have influence on the recognition effect of random forest disaggregated model, in order to promote identification further Effect, the present invention can also build multiple random forest disaggregated model (degree of depth of decision tree number, intrinsic dimensionality and decision tree Value different), the most also build a test sample collection and respectively multiple random forest disaggregated models tested, and according to Test result therefrom selects an optimum random forest disaggregated model.As it is shown in figure 5, the present invention can also include:
Step B1, concentrate from test sample and extract the communication behavior index of each test sample one by one, and will extract All communication behavior indexs are input in each random forest disaggregated model, thus obtain each random forest disaggregated model to survey The most whether sample is the result of determination harassing number;
Step B2, the harassing and wrecking number identified by each random forest disaggregated model are with confirmed harassing and wrecking number (such as The harassing and wrecking number of Internet firm's mark, the harassing and wrecking number etc. of operator's report and complaint system mark) mate, calculate respectively The accuracy rate of each random forest disaggregated model and recall rate;
Step B3, according to accuracy rate and recall rate, calculate the discrimination of each random forest disaggregated model:Wherein Precision is accuracy rate, and Recall is recall rate, and from all at random The discrimination of forest classified model is selected F maximum, the random forest disaggregated model that described maximum is corresponding be i.e. optimum with Machine forest classified model.
The foregoing is only presently preferred embodiments of the present invention, not in order to limit the present invention, all essences in the present invention Within god and principle, any modification, equivalent substitution and improvement etc. done, within should be included in the scope of protection of the invention.

Claims (10)

1. the recognition methods harassing number, it is characterised in that include:
Step one, choose some confirmed harassing and wrecking and non-harassing and wrecking number, calculate described harassing and wrecking and non-harassing and wrecking number when one section In communication behavior index, then described harassing and wrecking and non-harassing and wrecking number and communication behavior index thereof are formed training sample set Building random forest disaggregated model, the input of described random forest disaggregated model is the communication behavior index of each Subscriber Number, Output is that all decision trees determine that it is harassing and wrecking number and the prediction probability of non-harassing and wrecking number;
Step 2, by the number to be identified communication behavior index input random forest disaggregated model within a period of time, and calculate In random forest disaggregated model, all decision trees determine that it is harassing and wrecking number and the prediction probability of non-harassing and wrecking number, to sentence accordingly Whether fixed described number to be identified is harassing and wrecking numbers.
Method the most according to claim 1, it is characterised in that calculate Subscriber Number communication behavior within a period of time and refer to Mark, has farther included:
Step A, collection user's history call bill data of continuous many days, choose multiple time granularity, then look for user each High frequency talk period under time granularity, finally according to the communication row in user's high frequency talk period under each time granularity User's communication behavior index under each time granularity, Subscriber Number communication behavior within a period of time is calculated for index Index includes but not limited to: user's communication behavior index, Subscriber Number and ID under each time granularity.
Method the most according to claim 2, it is characterised in that described time granularity value but be not limited to: 1 minute, 5 points Clock, 15 minutes, 30 minutes, 60 minutes, 180 minutes, 360 minutes, 720 minutes, 1440 minutes.
Method the most according to claim 2, it is characterised in that step A has farther included:
Step A1, extract the call bill data of user every day one by one;
Step A2, read beginning and ending time of this day call bill data, and calculate the maximum time granularity that the described beginning and ending time covered Tmax, i.e. picks out its value maximum time grain less than the corresponding duration of described beginning and ending time from the multiple time granularities chosen Degree;
Step A3, extract each time granularity one by one, and judge extracted time granularity whether less than or equal to Tmax, if It is then the corresponding duration of the beginning and ending time of this day call bill data to be divided into multiple continuous print and with the time granularity extracted For the communication time period of duration, then calculating user's call frequency in each communication time period, described user is when each communication The call frequency in Duan that is to say the extracted time granularity call frequency at each communication time period of this day, under continuing to extract One time granularity, until having extracted all time granularities;If it is not, then continue to extract next time granularity, until extracting Complete all time granularities;
Step A4, judge whether to have extracted the call bill data in all skies, if it is, continue next step;If it is not, then continue to carry Take the call bill data of next day of family, then turn to step A2;
Step A5, the call frequency of all communication time period in all skies, select maximum from each time granularity, described The communication time period of big value correspondence is i.e. user's high-frequency communication period under this time granularity;
Step A6, calculating user's communication behavior index under each time granularity, that is to say that user is under each time granularity High frequency talk period in communication behavior index.
Method the most according to claim 1, it is characterised in that communication behavior index includes but not limited to: the calling frequency, quilt Cry number, the duration of call, ring duration, actively discharge number of times, passively release number of times, called dispersion, same caller called Correlation coefficient between number, call identical No. ten thousand section maximum frequencys, caller accounting, call time separation standard poor, wherein:
Between the called number of same caller, correlation coefficient is that all called numbers that user called there are call each other The ratio of all called numbers sum that the number of behavior and user called, calling identical No. ten thousand section maximum frequencys is that user exhales That cried and belong to the called number maximum number of identical No. ten thousand sections, wherein, No. ten thousand sections are after called number is removed latter 4 Residue section.
Method the most according to claim 1, it is characterised in that in step one, by described harassing and wrecking and non-harassing and wrecking number and Communication behavior index forms training sample set and builds random forest disaggregated model, has farther included:
Step 11, employing random forest disaggregated model, concentrate M communication behavior index of each training sample from training sample Randomly select m communication behavior index to be trained to produce K character subset, thus generate K decision tree, every decision tree Including self prediction probability to input number generic, wherein, decision tree uses the method for Gini impurity level to build.
Method the most according to claim 6, it is characterised in that in described step 11, for kth decision tree, it generates Process has farther included:
Step 111, employing bootstrap method have the N number of sample of the extraction put back to constitute kth certainly from training sample set S Sample set s (k) of the root node of plan tree, and set the minimum number of branch's Covering samples as d, wherein, N is training sample set S Sample number, in training sample set S, the intrinsic dimensionality of each training sample is M;
Step 112, random from M dimensional feature extract the m dimension indicator feature set as kth decision tree;
Step 113, from the beginning of the root node of kth decision tree, according to Gini impurity level minimum principle, respectively calculate m dimensional feature Gini impurity level:Wherein, IG(jk) it is kth decision tree The Gini impurity level of jth dimensional feature, Z1Be sample set s (k) of root node under the characteristic condition of j after optimum y-bend separates institute The sample set of the left sibling produced, Z2Be sample set s (k) of root node under the characteristic condition of j after optimum y-bend separates institute The sample set of the right node produced, N (Z1)、N(Z2) it is Z respectively1、Z2Sample size, Gini (Z1)、Gini(Z2) it is Z respectively1、Z2 Gini impurity level;
Step 114, from the Gini impurity level of m dimensional feature, choose minima, and using minima characteristic of correspondence as root node, This root node is split into left sibling and right node again, then using root node as restrictive condition, continues to calculate described root node The Gini impurity level of m dimensional feature, and select wherein that minima characteristic of correspondence is as root node continued growth, by that analogy, Thus form decision tree, if wherein have the number of branch's Covering samples to be less than d, then the present node of described branch is set Stop growing for leaf node, i.e. this node, continue to train other nodes, until all nodes were all trained or were labeled For leaf node.
Method the most according to claim 7, it is characterised in that the computing formula of m is:Gini(Zl) Computing formula is:Wherein, l=1 or 2, i=0 or 1, when i=0 then represents non-harassing and wrecking number, work as i =1 represents harassing and wrecking number,It is the Z in kth decision tree under jth characteristic conditionlIn respective branches, identified disturbs Disturb number or the probability of non-harassing and wrecking number.
Method the most according to claim 1, it is characterised in that step 2 has farther included:
Step 21, by the number to be identified communication behavior index input random forest disaggregated model within a period of time, calculate every Each leaf node of decision tree treats the prediction probability of identification number generic:Wherein, i=0 or 1, when i=0 is then expressed as non-harassing and wrecking number, when i=1 is then expressed as harassing number,It is the r leaf of kth decision tree Child node is treated identification number and is belonged to the prediction probability of i-th classification,It is in the r leaf node of kth decision tree Belong to the number number of i-th classification,It it is the number sum that comprises of the r leaf node of kth decision tree;
Step 22, calculate every decision tree and judge the prediction probability of number generic to be identified:Its In, PkI () is that kth decision tree judges that number to be identified belongs to the prediction probability of i-th classification, RkIt it is the leaf of kth decision tree Child node sum;
Step 23, calculate all decision trees judge number generic to be identified prediction probability sum:Its In, w (i) is that all decision trees judge that number to be identified belongs to the prediction probability sum of i-th classification, the most therefrom selects maximum Value, number generic to be identified is i.e. the classification that maximum is corresponding.
Method the most according to claim 1, it is characterised in that also include:
Step B1, concentrate from test sample and extract the communication behavior index of each test sample one by one, and all by extract Communication behavior index is input in each random forest disaggregated model, thus obtains each random forest disaggregated model to test specimens Whether this is the result of determination harassing number;
Step B2, the harassing and wrecking number identified by each random forest disaggregated model mate with confirmed harassing and wrecking number, Calculate accuracy rate and the recall rate of each random forest disaggregated model respectively;
Step B3, according to accuracy rate and recall rate, calculate the discrimination of each random forest disaggregated model: Wherein Precision is accuracy rate, and Recall is recall rate, and selects F from the discrimination of all random forest disaggregated models Maximum, the random forest disaggregated model that described maximum is corresponding is i.e. optimal stochastic forest classified model.
CN201610710545.0A 2016-08-24 2016-08-24 A kind of recognition methods harassing number Pending CN106255116A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610710545.0A CN106255116A (en) 2016-08-24 2016-08-24 A kind of recognition methods harassing number

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610710545.0A CN106255116A (en) 2016-08-24 2016-08-24 A kind of recognition methods harassing number

Publications (1)

Publication Number Publication Date
CN106255116A true CN106255116A (en) 2016-12-21

Family

ID=57594647

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610710545.0A Pending CN106255116A (en) 2016-08-24 2016-08-24 A kind of recognition methods harassing number

Country Status (1)

Country Link
CN (1) CN106255116A (en)

Cited By (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106779868A (en) * 2016-12-30 2017-05-31 中国民航信息网络股份有限公司 Big customer's labeling method and device
CN106982284A (en) * 2017-04-12 2017-07-25 北京奇虎科技有限公司 The recognition methods of harassing call number and device
CN107133265A (en) * 2017-03-31 2017-09-05 咪咕动漫有限公司 A kind of method and device of identification behavior abnormal user
CN107273531A (en) * 2017-06-28 2017-10-20 百度在线网络技术(北京)有限公司 Telephone number classifying identification method, device, equipment and storage medium
CN107506776A (en) * 2017-01-16 2017-12-22 恒安嘉新(北京)科技股份公司 A kind of analysis method of fraudulent call number
CN107733900A (en) * 2017-10-23 2018-02-23 中国人民解放军信息工程大学 One kind communication network users abnormal call behavioral value method for early warning
CN108198086A (en) * 2018-01-31 2018-06-22 海南海航信息技术有限公司 For identifying the method and apparatus in harassing and wrecking source according to communication behavior feature
CN108256542A (en) * 2016-12-29 2018-07-06 北京搜狗科技发展有限公司 A kind of feature of communication identifier determines method, apparatus and equipment
CN108810230A (en) * 2017-04-26 2018-11-13 腾讯科技(深圳)有限公司 A kind of method, apparatus and equipment obtaining incoming call prompting information
CN108989581A (en) * 2018-09-21 2018-12-11 中国银行股份有限公司 A kind of consumer's risk recognition methods, apparatus and system
CN109241418A (en) * 2018-08-22 2019-01-18 中国平安人寿保险股份有限公司 Abnormal user recognition methods and device, equipment, medium based on random forest
CN109429230A (en) * 2017-08-28 2019-03-05 中国移动通信集团浙江有限公司 A kind of communication swindle recognition methods and system
CN109474756A (en) * 2018-11-16 2019-03-15 国家计算机网络与信息安全管理中心 A kind of telecommunications method for detecting abnormality indicating study based on contract network
CN109525739A (en) * 2018-12-25 2019-03-26 亚信科技(中国)有限公司 A kind of telephone number recognition methods, device and server
CN109547393A (en) * 2017-09-21 2019-03-29 腾讯科技(深圳)有限公司 Malice number identification method, device, equipment and storage medium
CN109587357A (en) * 2018-11-14 2019-04-05 上海麦图信息科技有限公司 A kind of recognition methods of harassing call
CN109688275A (en) * 2018-12-27 2019-04-26 中国联合网络通信集团有限公司 Harassing call recognition methods, device and storage medium
CN109995924A (en) * 2017-12-30 2019-07-09 中国移动通信集团贵州有限公司 Cheat phone recognition methods, device, equipment and medium
CN110147430A (en) * 2019-04-25 2019-08-20 上海欣方智能系统有限公司 Harassing call recognition methods and system based on random forests algorithm
CN110177179A (en) * 2019-05-16 2019-08-27 国家计算机网络与信息安全管理中心 A kind of swindle number identification method based on figure insertion
CN110275956A (en) * 2019-06-24 2019-09-24 成都数之联科技有限公司 A kind of personal identification method and system
CN110351731A (en) * 2018-04-08 2019-10-18 中兴通讯股份有限公司 A kind of method and device of phone number antifraud
CN110414543A (en) * 2018-04-28 2019-11-05 中国移动通信集团有限公司 A kind of method of discrimination, equipment and the computer storage medium of telephone number danger level
CN110505353A (en) * 2019-08-30 2019-11-26 北京泰迪熊移动科技有限公司 A kind of number identification method, equipment and computer storage medium
CN111062422A (en) * 2019-11-29 2020-04-24 上海观安信息技术股份有限公司 Method and device for systematic identification of road loan
CN111104521A (en) * 2019-12-18 2020-05-05 上海观安信息技术股份有限公司 Anti-fraud detection method and detection system based on graph analysis
CN111126434A (en) * 2019-11-19 2020-05-08 山东省科学院激光研究所 Automatic microseism first arrival time picking method and system based on random forest
CN111432080A (en) * 2018-12-24 2020-07-17 北京奇虎科技有限公司 Ticket data processing method, electronic equipment and computer readable storage medium
CN111885270A (en) * 2020-07-09 2020-11-03 恒安嘉新(北京)科技股份公司 Abnormal communication detection method, device, equipment and storage medium
CN111918226A (en) * 2020-07-23 2020-11-10 广州市申迪计算机系统有限公司 Real-time signaling-based method and device for analyzing international high-settlement embezzlement behavior
CN113709747A (en) * 2020-05-09 2021-11-26 中国移动通信集团有限公司 Harassment number identification method and device, computer equipment and storage medium
CN113946720A (en) * 2020-07-17 2022-01-18 中国移动通信集团广东有限公司 Method and device for identifying users in group and electronic equipment
CN114449106A (en) * 2022-02-10 2022-05-06 恒安嘉新(北京)科技股份公司 Abnormal telephone number identification method, device, equipment and storage medium
CN114979369A (en) * 2022-04-14 2022-08-30 马上消费金融股份有限公司 Abnormal call detection method and device, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1805428A (en) * 2006-01-24 2006-07-19 陈永霞 Number sorted communication network technique
CN104023109A (en) * 2014-06-27 2014-09-03 深圳市中兴移动通信有限公司 Incoming call prompt method and device as well as incoming call classifying method and device
WO2015062209A1 (en) * 2013-10-29 2015-05-07 华为技术有限公司 Visualized optimization processing method and device for random forest classification model
CN104717674A (en) * 2014-12-02 2015-06-17 北京奇虎科技有限公司 Number attribute recognition method and device, terminal and server
CN105718490A (en) * 2014-12-04 2016-06-29 阿里巴巴集团控股有限公司 Method and device for updating classifying model

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1805428A (en) * 2006-01-24 2006-07-19 陈永霞 Number sorted communication network technique
WO2015062209A1 (en) * 2013-10-29 2015-05-07 华为技术有限公司 Visualized optimization processing method and device for random forest classification model
CN104023109A (en) * 2014-06-27 2014-09-03 深圳市中兴移动通信有限公司 Incoming call prompt method and device as well as incoming call classifying method and device
CN104717674A (en) * 2014-12-02 2015-06-17 北京奇虎科技有限公司 Number attribute recognition method and device, terminal and server
CN105718490A (en) * 2014-12-04 2016-06-29 阿里巴巴集团控股有限公司 Method and device for updating classifying model

Cited By (49)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108256542A (en) * 2016-12-29 2018-07-06 北京搜狗科技发展有限公司 A kind of feature of communication identifier determines method, apparatus and equipment
CN106779868A (en) * 2016-12-30 2017-05-31 中国民航信息网络股份有限公司 Big customer's labeling method and device
CN107506776A (en) * 2017-01-16 2017-12-22 恒安嘉新(北京)科技股份公司 A kind of analysis method of fraudulent call number
CN107133265A (en) * 2017-03-31 2017-09-05 咪咕动漫有限公司 A kind of method and device of identification behavior abnormal user
CN107133265B (en) * 2017-03-31 2021-07-09 咪咕动漫有限公司 Method and device for identifying user with abnormal behavior
CN106982284A (en) * 2017-04-12 2017-07-25 北京奇虎科技有限公司 The recognition methods of harassing call number and device
CN108810230A (en) * 2017-04-26 2018-11-13 腾讯科技(深圳)有限公司 A kind of method, apparatus and equipment obtaining incoming call prompting information
CN107273531A (en) * 2017-06-28 2017-10-20 百度在线网络技术(北京)有限公司 Telephone number classifying identification method, device, equipment and storage medium
CN107273531B (en) * 2017-06-28 2021-01-08 百度在线网络技术(北京)有限公司 Telephone number classification identification method, device, equipment and storage medium
CN109429230B (en) * 2017-08-28 2022-01-25 中国移动通信集团浙江有限公司 Communication fraud identification method and system
CN109429230A (en) * 2017-08-28 2019-03-05 中国移动通信集团浙江有限公司 A kind of communication swindle recognition methods and system
CN109547393A (en) * 2017-09-21 2019-03-29 腾讯科技(深圳)有限公司 Malice number identification method, device, equipment and storage medium
CN107733900A (en) * 2017-10-23 2018-02-23 中国人民解放军信息工程大学 One kind communication network users abnormal call behavioral value method for early warning
CN107733900B (en) * 2017-10-23 2019-10-29 中国人民解放军信息工程大学 A kind of communication network users abnormal call behavioral value method for early warning
CN109995924A (en) * 2017-12-30 2019-07-09 中国移动通信集团贵州有限公司 Cheat phone recognition methods, device, equipment and medium
CN108198086B (en) * 2018-01-31 2021-06-25 海南海航信息技术有限公司 Method and device for identifying disturbance source according to communication behavior characteristics
CN108198086A (en) * 2018-01-31 2018-06-22 海南海航信息技术有限公司 For identifying the method and apparatus in harassing and wrecking source according to communication behavior feature
CN110351731A (en) * 2018-04-08 2019-10-18 中兴通讯股份有限公司 A kind of method and device of phone number antifraud
CN110414543A (en) * 2018-04-28 2019-11-05 中国移动通信集团有限公司 A kind of method of discrimination, equipment and the computer storage medium of telephone number danger level
CN109241418B (en) * 2018-08-22 2024-04-09 中国平安人寿保险股份有限公司 Abnormal user identification method and device based on random forest, equipment and medium
CN109241418A (en) * 2018-08-22 2019-01-18 中国平安人寿保险股份有限公司 Abnormal user recognition methods and device, equipment, medium based on random forest
CN108989581B (en) * 2018-09-21 2022-03-22 中国银行股份有限公司 User risk identification method, device and system
CN108989581A (en) * 2018-09-21 2018-12-11 中国银行股份有限公司 A kind of consumer's risk recognition methods, apparatus and system
CN109587357A (en) * 2018-11-14 2019-04-05 上海麦图信息科技有限公司 A kind of recognition methods of harassing call
CN109587357B (en) * 2018-11-14 2021-04-06 上海麦图信息科技有限公司 Crank call identification method
CN109474756A (en) * 2018-11-16 2019-03-15 国家计算机网络与信息安全管理中心 A kind of telecommunications method for detecting abnormality indicating study based on contract network
CN109474756B (en) * 2018-11-16 2020-09-22 国家计算机网络与信息安全管理中心 Telecommunication anomaly detection method based on collaborative network representation learning
CN111432080A (en) * 2018-12-24 2020-07-17 北京奇虎科技有限公司 Ticket data processing method, electronic equipment and computer readable storage medium
CN109525739A (en) * 2018-12-25 2019-03-26 亚信科技(中国)有限公司 A kind of telephone number recognition methods, device and server
CN109688275A (en) * 2018-12-27 2019-04-26 中国联合网络通信集团有限公司 Harassing call recognition methods, device and storage medium
CN110147430A (en) * 2019-04-25 2019-08-20 上海欣方智能系统有限公司 Harassing call recognition methods and system based on random forests algorithm
CN110177179A (en) * 2019-05-16 2019-08-27 国家计算机网络与信息安全管理中心 A kind of swindle number identification method based on figure insertion
CN110177179B (en) * 2019-05-16 2020-12-29 国家计算机网络与信息安全管理中心 Fraud number identification method based on graph embedding
CN110275956A (en) * 2019-06-24 2019-09-24 成都数之联科技有限公司 A kind of personal identification method and system
CN110505353A (en) * 2019-08-30 2019-11-26 北京泰迪熊移动科技有限公司 A kind of number identification method, equipment and computer storage medium
CN111126434A (en) * 2019-11-19 2020-05-08 山东省科学院激光研究所 Automatic microseism first arrival time picking method and system based on random forest
CN111062422B (en) * 2019-11-29 2023-07-14 上海观安信息技术股份有限公司 Method and device for identifying set-way loan system
CN111062422A (en) * 2019-11-29 2020-04-24 上海观安信息技术股份有限公司 Method and device for systematic identification of road loan
CN111104521A (en) * 2019-12-18 2020-05-05 上海观安信息技术股份有限公司 Anti-fraud detection method and detection system based on graph analysis
CN111104521B (en) * 2019-12-18 2023-10-17 上海观安信息技术股份有限公司 Anti-fraud detection method and detection system based on graph analysis
CN113709747A (en) * 2020-05-09 2021-11-26 中国移动通信集团有限公司 Harassment number identification method and device, computer equipment and storage medium
CN113709747B (en) * 2020-05-09 2023-10-13 中国移动通信集团有限公司 Harassment number identification method and device, computer equipment and storage medium
CN111885270A (en) * 2020-07-09 2020-11-03 恒安嘉新(北京)科技股份公司 Abnormal communication detection method, device, equipment and storage medium
CN111885270B (en) * 2020-07-09 2021-08-24 恒安嘉新(北京)科技股份公司 Abnormal communication detection method, device, equipment and storage medium
CN113946720A (en) * 2020-07-17 2022-01-18 中国移动通信集团广东有限公司 Method and device for identifying users in group and electronic equipment
CN111918226A (en) * 2020-07-23 2020-11-10 广州市申迪计算机系统有限公司 Real-time signaling-based method and device for analyzing international high-settlement embezzlement behavior
CN114449106A (en) * 2022-02-10 2022-05-06 恒安嘉新(北京)科技股份公司 Abnormal telephone number identification method, device, equipment and storage medium
CN114449106B (en) * 2022-02-10 2024-04-30 恒安嘉新(北京)科技股份公司 Method, device, equipment and storage medium for identifying abnormal telephone number
CN114979369A (en) * 2022-04-14 2022-08-30 马上消费金融股份有限公司 Abnormal call detection method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN106255116A (en) A kind of recognition methods harassing number
CN109766872B (en) Image recognition method and device
CN109600752A (en) A kind of method and apparatus of depth cluster swindle detection
CN105260628B (en) Classifier training method and apparatus, auth method and system
CN104732978B (en) The relevant method for distinguishing speek person of text based on combined depth study
WO2017143932A1 (en) Fraudulent transaction detection method based on sample clustering
CN104410973B (en) A kind of fraudulent call recognition methods of playback and system
CN107133265A (en) A kind of method and device of identification behavior abnormal user
CN103258535A (en) Identity recognition method and system based on voiceprint recognition
CN110353673A (en) A kind of brain electric channel selection method based on standard mutual information
CN106453971B (en) The acquisition methods and call center's quality inspection system of call center's quality inspection voice
CN109034194A (en) Transaction swindling behavior depth detection method based on feature differentiation
CN106843941B (en) Information processing method, device and computer equipment
CN110248322A (en) A kind of swindling gang identifying system and recognition methods based on fraud text message
CN110084149A (en) A kind of face verification method based on difficult sample four-tuple dynamic boundary loss function
CN106601243A (en) Video file identification method and device
CN106469181A (en) A kind of user behavior pattern analysis method and device
CN113221673B (en) Speaker authentication method and system based on multi-scale feature aggregation
CN108536866B (en) Microblog hidden key user analysis method based on topic transfer entropy
CN109684374A (en) A kind of extracting method and device of the key-value pair of time series data
CN109903053A (en) A kind of anti-fraud method carrying out Activity recognition based on sensing data
CN113961712A (en) Knowledge graph-based fraud telephone analysis method
CN107704631B (en) Crowdsourcing-based music annotation atom library construction method
CN109493882A (en) A kind of fraudulent call voice automatic marking system and method
CN110458094A (en) Device class method based on fingerprint similarity

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20161221

RJ01 Rejection of invention patent application after publication