CN104469025A

CN104469025A - Clustering-algorithm-based method and system for intercepting fraud phone in real time

Info

Publication number: CN104469025A
Application number: CN201410693578.XA
Authority: CN
Inventors: 廖建新; 王彦青; 林大庆; 林建洪; 张锦然; 单瑞超; 马宪
Original assignee: Hangzhou Dongxin Beiyou Information Technology Co Ltd
Current assignee: Xinxun Digital Technology (Hangzhou) Co.,Ltd.
Priority date: 2014-11-26
Filing date: 2014-11-26
Publication date: 2015-03-25
Anticipated expiration: 2034-11-26
Also published as: CN104469025B

Abstract

The invention provides a clustering-algorithm-based method and system for intercepting fraud phone in real time. The method includes the steps that several characteristic index values of all calling numbers in a certain time period are calculated, and then all the calling numbers are divided into three clusters by the adoption of a clustering algorithm, so that the calling numbers in all the clusters are provided with the same or similar characteristic index values; the characteristic index values of determined fraud numbers are matched with the characteristic index values of the calling numbers in the three clusters, the closer the value taking intervals composed of the characteristic values are, the higher the matching similarity degree is, finally the cluster with the highest matching similarity degree is set as the fraud phone cluster, and the cluster with the secondary matching similarity degree is set as the suspected fraud phone cluster; all the calling numbers in the fraud phone cluster and the suspected fraud phone cluster are updated into a forensics number table and an intercepted number table respectively. The clustering-algorithm-based method and system belong to the technical field of network communication, and the fraud numbers can be recognized automatically and precisely and intercepted in real time in the whole network range.

Description

A kind of method and system of the real-time blocking fraudulent call based on clustering algorithm

Technical field

The present invention relates to a kind of method and system of the real-time blocking fraudulent call based on clustering algorithm, belong to network communication technology field.

Background technology

Along with popularizing of mobile phone, telephone fraud emerges in an endless stream.Although relevant government department sends prompting to society, all kinds of news media also report again and again, but, still have every day a large number of users to have dust thrown into the eyes, and economic loss is in ascendant trend year by year.

What mainly take fraudulent call at present is blacklist interception mode, is about to confirm in swindle number write blacklist.Such as: patent application CN 201310004829.4 (application title: a kind of spam call intercepting system based on call mode identification and method of work thereof, applicant: Shanghai Xin Fang intelligent system Co., Ltd, the applying date: 2013 ?01 ?07) behavioural habits when hearing voice message based on telephone subscriber and proposing in conjunction with speech recognition technology, this system needs the telephone subscriber configuring doubtful risk on the gateway exchange or tandem exchange's switch of existing communication net, and the call in attribute simultaneously can contracted according to user, the signaling message stream of doubtful spam call and Media Stream are sent into respectively this system and perform Call Intercept analysis operation, also following apparatus will be set up: call mode identification and Call Intercept server and Service Database, audio analysis server, SGW and media gateway.Owing to swindling the means of one's share of expenses for a joint undertaking in continuous conversion, swindle number is more and more hidden, and its form is also more and more diversified, although increasing swindle number is found and confirms, but relative to the fraudulent call existing for the whole network, confirm to swindle the just wherein very little part of number.This technical scheme does not relate to automatically precisely identifying and real-time blocking to swindle number in network-wide basis.

Therefore, in network-wide basis, realize automatically precisely identifying and real-time blocking of swindle number, be a technical problem being worth further investigation.

Summary of the invention

In view of this, the object of this invention is to provide a kind of method and system of the real-time blocking fraudulent call based on clustering algorithm, can realize swindling automatically precisely identifying and real-time blocking of number in network-wide basis.

In order to achieve the above object, the invention provides a kind of method of the real-time blocking fraudulent call based on clustering algorithm, include:

Step one, according to gathered ticket writing, calculate all calling numbers several characteristic index values within the certain hour cycle, then adopt clustering algorithm all calling numbers to be divided in three bunches, thus make the calling number in each bunch have identical or close characteristic index value;

Step 2, the characteristic index value of calling number in the characteristic index value respectively with three bunches confirming to swindle number to be mated, if the interval that characteristic index value is formed is more close, illustrate that matching similarity is higher, finally the highest for wherein matching similarity bunch is set to fraudulent call bunch, matching similarity takes second place bunch is set to doubtful fraudulent call bunch;

Step 3, all calling numbers in swindle number bunch and doubtful swindle number bunch to be updated in evidence obtaining directory and interception directory respectively.

In order to achieve the above object, present invention also offers a kind of system of the real-time blocking fraudulent call based on clustering algorithm, include anti-swindle platform, wherein, anti-swindle platform includes further:

Cluster analyzing device, for according to gathered ticket writing, calculate all calling numbers several characteristic index values within the certain hour cycle, then clustering algorithm is adopted to be divided in three bunches by all calling numbers, thus make the calling number in each bunch have identical or close characteristic index value, again the characteristic index value of calling number in the characteristic index value respectively with three bunches confirming to swindle number is mated, if the interval that characteristic index value is formed is more close, illustrate that matching similarity is higher, finally be set to fraudulent call bunch by the highest for wherein matching similarity bunch, what matching similarity took second place bunch is set to doubtful fraudulent call bunch,

Directory updating device, for being updated to all calling numbers in swindle number bunch and doubtful swindle number bunch respectively in evidence obtaining directory and interception directory.

Compared with prior art, the invention has the beneficial effects as follows: the present invention carries out tagsort by clustering algorithm, the calling number with same or similar feature is divided into respectively in swindle number bunch and doubtful swindle number bunch, and then select according to logistic regression algorithm the swindle number and doubtful swindle number determined separately, thus can realize swindling automatically precisely identifying and real-time blocking of number in network-wide basis; For swindle number, the present invention carries out recording evidence obtaining further, and is saved in Sample Storehouse by recording file, thus can ensure that the information in Sample Storehouse is more and more abundanter, and the accuracy of identification of fraudulent call is more and more higher; For doubtful swindle number, swindle sample in its recording file and Sample Storehouse identifies by the present invention further automatically, especially for the fraudulent call of playback, by carrying out the double weft degree Eigenvalues analysis of time and energy to voice, thus effectively can distinguish different phonetic, when identifying recording file and swindle sample is same voice, then ongoing call real-time blocking is interrupted.

Accompanying drawing explanation

Fig. 1 is the flow chart of the method for a kind of real-time blocking fraudulent call based on clustering algorithm of the present invention.

Fig. 2 is the concrete operations flow chart of Fig. 1 step one.

Fig. 3 is when a user initiates a call, it is implemented respectively to the concrete operations flow chart of recording evidence obtaining and real-time blocking.

Fig. 4 is by the concrete operations flow chart of the comparison one by one of the swindle sample in recording file and repeat tone Sample Storehouse.

Fig. 5 is the composition structural representation of the system of a kind of real-time blocking fraudulent call based on clustering algorithm of the present invention.

Fig. 6 is the composition structural representation of cluster analyzing device.

Fig. 7 is the composition structural representation of swindle blocking apparatus.

Fig. 8 is the composition structural representation of repeat tone recognition unit.

Embodiment

For making the object, technical solutions and advantages of the present invention clearly, below in conjunction with accompanying drawing, the present invention is described in further detail.

Find according to research, fraudulent call, doubtful fraudulent call generally all have obvious feature difference, such as, fraudulent call has the calling of busy high frequency, called subscriber's Relatively centralized, the higher feature of spaced discrete degree call time, doubtful fraudulent call has high frequency calling, called subscriber's relative distribution, feature that calling circle registration dispersion higher, call time is higher, non-fraudulent call has low frequency calling and the time concentrates, and calling circle registration is lower, calling behavior is less, the feature of the basic noncall behavior of busy.Therefore, the present invention can adopt clustering algorithm, multiple characteristic index values according to calling numbers all in ticket writing carry out tagsort to calling number, the calling number with same or similar feature is assigned in one bunch, that is to say, whole user is divided into multiple bunches with obvious characteristic difference, then by and confirmed the Characteristic Contrast of fraudulent call, thus find and confirmed the immediate fraudulent call of fraudulent call feature bunch and more close doubtful fraudulent call bunch.For fraudulent call bunch and doubtful fraudulent call bunch, the present invention adopts logistic regression algorithm precisely to identify fraudulent call wherein and doubtful fraudulent call more further, thus realizes accurate identification and the interception of fraudulent call in network-wide basis.

As shown in Figure 1, the method for a kind of real-time blocking fraudulent call based on clustering algorithm of the present invention, includes:

Because fraudulent call, doubtful fraudulent call have same or analogous feature, multiple characteristic index having notable difference can be chosen, by constantly trying out and verifying discovery, the present invention can choose following characteristic index and effectively identify fraudulent call and non-fraudulent call: the calling frequency, called number number, poor, the frequent called number number of calls of separation standard call time, the most high call period, call out same called number number of times maximum, call out the Second Largest Value of same called number number of times, call out the third-largest value of same called number number of times.Judge above-mentioned multiple characteristic index value whether with confirm that the characteristic index value of fraudulent call is in identical or close interval range, when characteristic index value is more close, then illustrate that matching similarity is higher.Meanwhile, by the calling number in three bunches and can also confirm that swindle number is compared, thus count in three bunches the number having confirmed to swindle number.Finally, consider from the matching similarity of multiple characteristic index value, the many factors such as number that confirms to swindle number, from three bunches, select a fraudulent call bunch and a doubtful fraudulent call bunch;

Step 3, adopt logistic regression algorithm, calculate the suspicious degree index of swindle of each calling number in swindle number bunch or doubtful swindle number bunch respectively: wherein, z _ijbe i-th calling number in bunch j, j=1 or 2, bunch 1 is swindle number bunch, and bunches 2 is doubtful swindle numbers bunch, Y (z _ij) be calling number z _ijswindle characteristic value, n is characteristic index number, α _jtthe weight coefficient of the characteristic index t in bunch j, calling number z _ijthe value of characteristic index t, β _jbe the maximum likelihood estimation of bunch j, then judge that the swindle of calling number suspicious degree index is greater than the threshold value swindling suspicious degree index? if so, then illustrate that this calling number is fraudulent call or doubtful fraudulent call; If not, then illustrate that this calling number is not swindle number or doubtful swindle number, the swindle number that belongs to from calling number bunch or doubtful swindle number bunch, delete described calling number;

The threshold value of the suspicious degree index of described swindle is interval [0,1) real number between, its value can be established according to actual conditions, when swindling suspicious degree index and being larger, calling number is that the possibility of fraudulent call/doubtful fraudulent call is also larger, the threshold value such as swindling suspicious degree index is set to 0.9, when the suspicious degree index of swindle of calling number is more than or equal to 0.9, then determines that this calling number is fraudulent call/or doubtful fraudulent call; For α _jt, β _jvalue, can confirm that fraudulent call and non-fraudulent call are used as sample from Extraction parts swindle number bunch or doubtful swindle number bunch, and to α _jt, β _jarrange initial value, whether the suspicious degree exponential sum of swindle then calling number each in sample calculated is that the actual conditions of fraudulent call contrast, then to α _jt, β _jvalue repeatedly adjust, thus make the suspicious degree index of swindle calculated according to sample meet system actual needs, such as, after constantly adjusting, the weight system of characteristic index " the calling frequency " is set to-0.6626, the weight system of characteristic index " called number number " is set to 0.004633, the weight system of characteristic index " call time, separation standard was poor " is set to-0.001043, the weight system of characteristic index " the frequent called number number of calls " is set to 0.351, and maximum likelihood estimation is set to-6.189;

Step 4, all calling numbers in swindle number bunch and doubtful swindle number bunch to be updated in evidence obtaining directory and interception directory respectively.That is, write in evidence obtaining directory by the calling number in swindle number bunch, the calling number in doubtful swindle number bunch is write in interception directory.

As shown in Figure 2, step one can further include:

Step 11, calculate all calling numbers several characteristic index values within the certain hour cycle, and build characteristic of correspondence index set respectively for all calling numbers: X _i=(x _i1, x _i2..., x _iN), wherein X _icalling number z _icharacteristic index collection, x _i1, x _i2... x _iNcalling number z respectively _iseveral characteristic index values, N is characteristic index number;

Such as, following characteristic index can be chosen: the calling frequency, called number number, poor, the frequent called number number of calls of separation standard call time, the most high call period, call out same called number number of times maximum, call out the Second Largest Value of same called number number of times, call out the third-largest value of same called number number of times, N=8;

Step 12, to build three bunches (such as bunch 1, bunches 2, bunches 3), and by all calling number random division in three bunches, what wherein each calling number was unique belongs to one bunch;

Step 13, calculate the characteristic index central value collection C of each bunch _j: wherein C _jthe characteristic index central value collection of bunch j, j=1,2 or 3, c _jin the central value of characteristic index t, t is a natural number between 1 to N, and i is 1 to M ^jbetween a natural number, M ^jthe calling number number in bunch j, the calling number z in bunch j _ijthe value of characteristic index t;

Step 14, calculate all calling numbers square error and: and do you judge that E is less than or equal to the threshold value of E? if so, then this flow process terminates; If not, then calculate the distance between each calling number and the characteristic index central value collection of all bunches again, and therefrom select the minimum value of distance, corresponding to the minimum value then calling number being repartitioned distance bunch in, wherein calling number z _iwith the computing formula of distance between the characteristic index central value collection of bunch j is as follows: x _itcalling number z _ithe value of characteristic index t, then turn to step 13, wherein, the threshold value of E is the number between 0 to 1, and its value can set according to actual conditions, such as 2.71828 ^-5.

For the fraudulent call in evidence obtaining directory and interception directory and doubtful fraudulent call, the present invention can also implement recording evidence obtaining and real-time blocking means, to realize effective control of fraudulent call respectively to it.As shown in Figure 3, when a user initiates a call, the present invention also includes:

Client-initiated calling is toggled to SCP by steps A 1, caller MSC, does SCP judge that the calling number of described call request is in evidence obtaining directory or interception directory? if, then return call proceeding CONTINUE message to caller MSC, evidence obtaining routing number or interception routing number information is carried in described call proceeding message, and indicate caller MSC calling to be continued to be toggled to anti-swindle platform, then continue next step; If not, then perform original operation flow, this flow process terminates;

When calling number is when collecting evidence in directory, then carrying evidence obtaining routing number in call proceeding message, when calling number is when tackling in directory, then carrying interception routing number in call proceeding message;

When steps A 2, anti-swindle platform receive the call request that caller MSC sends, do you judge in call request, to carry evidence obtaining routing number? if, then bridge joint is carried out to the voice channel in call request between calling and called, then unidirectional recording is carried out to caller voice, generate a recording file, then be saved in naturetone Sample Storehouse or repeat tone Sample Storehouse by described recording file, this flow process terminates; If not, then next step is continued;

Steps A 3, does anti-swindle platform judge to carry interception routing number in call request? if, then to main in call request, voice channel between called carries out bridge joint, then unidirectional recording is carried out to caller voice, recording S generates a recording file after second, then by recording file one by one with repeat tone Sample Storehouse, all swindle samples comparison one by one in naturetone Sample Storehouse, when recording file and swindle sample are same voice, then illustrate that described recording file is fraudulent call, instruction called MS C interrupts main, voice channel between called, when recording file and all swindle samples are not same voice, then illustrate that recording file is not fraudulent call, continue to perform original operation flow.

By the voice channel between bridge joint calling and called, the speech data between calling and called all will transmit through anti-swindle platform, because the voice of callee side then can form interference, so the present invention only carries out unidirectional recording to caller voice to caller voice.In steps A 2, can adopt manual type to recording file come audition screen, if in recording file be the fraudulent call that true man speak, then using recording file as swindle Sample preservation in naturetone Sample Storehouse; If in recording file be the fraudulent call of machine playback, then using recording file as swindle Sample preservation in repeat tone Sample Storehouse, so get off, along with being on the increase of swindle sample, the information of naturetone Sample Storehouse or repeat tone Sample Storehouse can be more and more abundanter, also can be more and more higher to the recognition correct rate of fraudulent call.In steps A 3, the value of S can set according to actual needs, to meet doubtful fraudulent call in communication process by Real time identification and interception.

In Fig. 3 steps A 3, by recording file one by one with all swindle samples comparison one by one in repeat tone Sample Storehouse, naturetone Sample Storehouse, can further include: first by the swindle sample comparison one by one in recording file and repeat tone Sample Storehouse, when all swindle samples in recording file and repeat tone Sample Storehouse are not same voice, then by the swindle sample comparison one by one in recording file and naturetone Sample Storehouse.

As shown in Figure 4, by the swindle sample comparison one by one in recording file and repeat tone Sample Storehouse, can further include:

Steps A 31, build a temporal characteristics value collection for recording file: from the voice starting point of recording file, be a frame with n second, from recording file, order extracts G W frame voice messaging one by one, and utilize speech terminals detection technology, calculate the frame number between efficient voice starting point to end point in each W frame voice messaging, described frame number is designated as the temporal characteristics value of described W frame voice messaging, then the temporal characteristics value that the G calculated a temporal characteristics value is saved in recording file according to the precedence of recording file is concentrated;

The two-door limit value decision method of short-time energy and zero-crossing rate can be adopted to detect voice starting point and end point, to reject the interference of call clear band; The value of n, G, W can set according to actual needs, such as n=10ms, G=100, W=5.By repeatedly testing discovery, the shortest voice length setting has good implementation result in more than 10s the present invention, i.e. G >=100, W=5;

Steps A 32, build an energy eigenvalue collection for recording file: from the voice starting point of recording file, be a frame with n second, from recording file or swindle sample, order extracts G*W frame voice messaging one by one, and calculate the short-time energy value of each frame voice messaging, described short-time energy value is designated as the energy eigenvalue of every frame voice messaging, then the energy eigenvalue that a described G*W energy eigenvalue is saved in recording file according to the precedence of recording file is concentrated;

Steps A 33, the temporal characteristics value collection reading a swindle sample from repeat tone Sample Storehouse and energy eigenvalue collection;

In repeat tone Sample Storehouse, the temporal characteristics value collection of each swindle sample is identical with the construction method of energy eigenvalue collection with the temporal characteristics value collection of recording file with the construction method of energy eigenvalue collection, does not repeat at this;

Steps A 34, recording file and swindle sample temporal characteristics value are separately concentrated the temporal characteristics value comparison being one by one in identical sorting position, thus the identical several TS of temporal characteristics value that the temporal characteristics value calculating recording file and swindle sample is concentrated;

Steps A 35, respectively from recording file and swindle sample energy eigenvalue concentrate extract before K energy eigenvalue, the value of K can set according to actual needs, such as K=5;

The energy multiplication factor of steps A 36, calculating swindle sample and recording file: wherein, YE _bb the energy eigenvalue that the energy eigenvalue of swindle sample is concentrated, GE _bb the energy eigenvalue that the energy eigenvalue of recording file is concentrated;

Steps A 37, according to energy multiplication factor B, each energy eigenvalue that the energy eigenvalue of recording file is concentrated to be adjusted: GE _b=B × GE _b, wherein, b is the natural number between 1 to G*W;

Steps A 38, the energy eigenvalue of recording file and swindle sample is concentrated the energy eigenvalue comparison being one by one in identical sorting position, thus the identical several ES of energy eigenvalue that the energy eigenvalue calculating recording file and swindle sample is concentrated;

The swindle voice confidence level of steps A 39, calculating recording file and swindle sample: wherein, F is the weight coefficient of confidence level, and do you judge that the swindle voice confidence level of recording file and swindle sample is greater than the threshold value CC of swindle voice confidence level? if, then represent that recording file and swindle sample are same voice, namely the caller incoming call that recording file is corresponding can be judged as fraudulent call, and this flow process terminates; If not, then represent that recording file and swindle sample are not same voice, continue from repeat tone Sample Storehouse, read next swindle sample temporal characteristics value collection and energy eigenvalue collection, then turn to steps A 34; Wherein, the value of the threshold value CC of F, swindle voice confidence level can be arranged according to actual conditions, such as, and F=0.5, CC=90%.

The comparison of the swindle sample in recording file and naturetone Sample Storehouse can be realized by the speaker Recognition Technology (abbreviation speaker Recognition Technology) that text is irrelevant.Speaker Recognition Technology is essentially the problem of a pattern matching, general principle is that the voice of target speaker to be identified are carried out feature extraction and pattern drill, the aspect of model obtained is mated with the aspect of model in naturetone Sample Storehouse, then judges which speaker in most likely naturetone Sample Storehouse according to the similarity of coupling.Feature extracting method relatively more conventional at present has based on linear predictive coding (Linear Predictive Coding, LPC) the general coefficient of linear prediction (Linear Predictive Cepstrum Coefficients, LPCC), based on the Mel frequency cepstral coefficient (Mel-scale Frequency Cepstral Coefficients, MFCC) of voice principle and acoustical principles; Common method for mode matching has based on dynamic time warping (dynamic time warping, DTW), vector quantization (VectorQuantization, VQ), hidden Markov model (Hidden Markov Model, and the template matching method etc. of gauss hybrid models (GaussianMixture Model, GMM) HMM).

Adopt different Characteristic Extraction and method for mode matching, quantification and the step identified are not quite similar, and are not described in detail here.Have data to show, use the speaker Recognition Technology based on GMM, Gaussian Mixture degree be 32, in the sufficient situation of training data, accuracy rate can reach 98%.

As shown in Figure 5, the system of a kind of real-time blocking fraudulent call based on clustering algorithm of the present invention, includes anti-swindle platform, service control point (SCP) and moving exchanging center MSC, wherein:

Caller MSC, for when receiving Client-initiated calling, being toggled to SCP by described calling, then according to the instruction of SCP, continuing calling to be toggled to anti-swindle platform;

SCP, for when receiving caller MSC and forwarding the user's call request come, judge whether the calling number of described call request is collecting evidence in directory or interception directory, if, then return call proceeding CONTINUE message to caller MSC, carry evidence obtaining routing number or interception routing number information in described call proceeding message, and indicate caller MSC calling to be continued to be toggled to anti-swindle platform; If not, then perform original operation flow, wherein, when calling number is when collecting evidence in directory, then carrying evidence obtaining routing number in call proceeding message, when calling number is when tackling in directory, then carrying interception routing number in call proceeding message;

Anti-swindle platform can further include:

Logistic regression device, for adopting logistic regression algorithm, calculates the suspicious degree index of swindle of each calling number in swindle number bunch or doubtful swindle number bunch respectively: wherein, z _ijbe i-th calling number in bunch j, j=1 or 2, bunch 1 is swindle number bunch, and bunches 2 is doubtful swindle numbers bunch, Y (z _ij) be calling number z _ijswindle characteristic value, n is characteristic index number, α _jtthe weight coefficient of the characteristic index t in bunch j, calling number z _ijthe value of characteristic index t, β _jbe the maximum likelihood estimation of bunch j, then judge whether the swindle of calling number suspicious degree index is greater than the threshold value swindling suspicious degree index, if so, then illustrate that this calling number is fraudulent call or doubtful fraudulent call; If not, then illustrate that this calling number is not swindle number or doubtful swindle number, the swindle number that belongs to from calling number bunch or doubtful swindle number bunch, delete described calling number;

Directory updating device, for being updated in evidence obtaining directory and interception directory respectively by all calling numbers in swindle number bunch and doubtful swindle number bunch;

Calling retransmission unit, during for receiving call request that caller MSC sends, judges whether carry evidence obtaining routing number in call request or tackle routing number, if carry evidence obtaining routing number, then notice recording apparatus for obtaining evidence, if carry interception routing number, then notice swindle blocking apparatus;

Recording apparatus for obtaining evidence, for carrying out bridge joint to the voice channel in call request between calling and called, then carrying out unidirectional recording to caller voice, generating a recording file, and be saved in by described recording file in naturetone Sample Storehouse or repeat tone Sample Storehouse;

Swindle blocking apparatus, for carrying out bridge joint to the voice channel in call request between calling and called, then unidirectional recording is carried out to caller voice, recording S generates a recording file after second, again by recording file one by one with all swindle samples comparison one by one in repeat tone Sample Storehouse, naturetone Sample Storehouse, when recording file and swindle sample are same voice, illustrate that recording file is fraudulent call, then indicate the voice channel between called MS C interruption calling and called.

As shown in Figure 6, cluster analyzing device can further include:

Characteristic index construction unit, for calculating all calling numbers several characteristic index values within the certain hour cycle, and builds characteristic of correspondence index set for all calling numbers: X respectively _i=(x _i1, x _i2..., x _iN), wherein X _icalling number z _icharacteristic index collection, x _i1, x _i2... x _iNcalling number z respectively _iseveral characteristic index values, N is characteristic index number;

Bunch build initialization unit, for building three bunches: bunch 1, bunches 2 and bunches 3, and by all calling number random division in three bunches, what wherein each calling number was unique belongs to one bunch;

Bunch center calculation unit, for calculating the characteristic index central value collection C of each bunch _j: wherein C _jthe characteristic index central value collection of bunch j, j=1,2 or 3, c _jin the central value of characteristic index t, t is a natural number between 1 to N, and i is 1 to M ^jbetween a natural number, M ^jthe calling number number in bunch j, the calling number z in bunch j _ijthe value of characteristic index t, then notify bunch adjustment unit calculate all calling numbers square error and;

Bunch adjustment unit, for calculate all calling numbers square error and: and judge whether E is less than or equal to the threshold value of E, if not, then calculate the distance between each calling number and the characteristic index central value collection of all bunches again, and therefrom select the minimum value of distance, then corresponding to minimum value calling number being repartitioned distance bunch in, wherein calling number z _iwith the computing formula of distance between the characteristic index central value collection of bunch j is as follows: x _itcalling number z _ithe value of characteristic index t, finally notify that bunch center calculation unit recalculates the characteristic index central value collection of each bunch, wherein, the threshold value of E is the number between 0 to 1, and its value can set according to actual conditions, such as 2.71828 ^-5.

As shown in Figure 7, swindle blocking apparatus can further include:

Voice recording unit, for receiving the call request that caller sends, the voice channel then between bridge joint calling and called, and after voice channel between calling and called sets up, carry out unidirectional recording to caller voice, recording S generates a recording file after second;

Repeat tone recognition unit, for by the swindle sample comparison one by one in recording file and repeat tone Sample Storehouse, to identify whether the swindle sample in recording file and repeat tone Sample Storehouse is same voice;

Naturetone recognition unit, for by the swindle sample comparison one by one in recording file and naturetone Sample Storehouse, to identify whether the swindle sample in recording file and naturetone Sample Storehouse is same voice.

As shown in Figure 8, repeat tone recognition unit can further include:

Temporal characteristics builds parts, for being recording file, or each swindle sample builds respective temporal characteristics value collection in repeat tone Sample Storehouse: from recording file or swindle sample voice starting point, be a frame with n second, from recording file or swindle sample, order extracts G W frame voice messaging one by one, and utilize speech terminals detection technology, calculate the frame number between efficient voice starting point to end point in each W frame voice messaging, described frame number is designated as the temporal characteristics value of described W frame voice messaging, then the temporal characteristics value that the G calculated a temporal characteristics value is saved in recording file or swindle sample according to the precedence in recording file or swindle sample is concentrated, wherein, the two-door limit value decision method of short-time energy and zero-crossing rate can be adopted to detect voice starting point and end point, to reject the interference of call clear band,

Energy feature builds parts, for being recording file, or each swindle sample builds respective energy eigenvalue collection in repeat tone Sample Storehouse: from recording file or swindle sample voice starting point, be a frame with n second, one by one from recording file, or order extracts G*W frame voice messaging in swindle sample, and calculate the short-time energy value of each frame voice messaging, described short-time energy value is designated as the energy eigenvalue of every frame voice messaging, then by a described G*W energy eigenvalue according to recording file, or the precedence of swindle sample is saved in recording file, or the energy eigenvalue of swindle sample is concentrated,

Swindle confidence calculations parts, for reading temporal characteristics value collection and the energy eigenvalue collection of each swindle sample from repeat tone Sample Storehouse, and the temporal characteristics value collection of recording file and swindle sample is sent to temporal characteristics identification component, the energy eigenvalue collection of recording file and swindle sample is sent to energy feature identification component simultaneously, then calculates the swindle voice confidence level of recording file and swindle sample: wherein, F is the weight coefficient of confidence level, and judges whether the swindle voice confidence level of recording file and swindle sample is greater than threshold value CC, if so, then represents that recording file and swindle sample are same voice; If not, then represent that recording file and swindle sample are not same voice;

Temporal characteristics identification component, for recording file and swindle sample temporal characteristics value are separately concentrated the temporal characteristics value comparison being one by one in identical sorting position, thus calculate the recording file temporal characteristics value identical several TS concentrated with the temporal characteristics value of swindle sample;

Energy feature identification component, extracting front K energy eigenvalue for concentrating from recording file and swindle sample energy eigenvalue separately, then calculating the energy multiplication factor of swindle sample and recording file: wherein, YE _bb the energy eigenvalue that the energy eigenvalue of swindle sample is concentrated, GE _bbe b the energy eigenvalue that the energy eigenvalue of recording file is concentrated, then according to energy multiplication factor B, each energy eigenvalue that the energy eigenvalue of recording file is concentrated adjusted: GE _b=B × GE _bwherein, b is the natural number between 1 to G*W, finally the energy eigenvalue of recording file and swindle sample is concentrated the energy eigenvalue comparison being one by one in identical sorting position, thus calculates the recording file energy eigenvalue identical several ES concentrated with the energy eigenvalue of swindle sample.

Naturetone recognition unit can realize the comparison of the swindle sample in recording file and naturetone Sample Storehouse by the speaker Recognition Technology (abbreviation speaker Recognition Technology) that text is irrelevant.Speaker Recognition Technology is essentially the problem of a pattern matching, general principle is that the voice of target speaker to be identified are carried out feature extraction and pattern drill, the aspect of model obtained is mated with the aspect of model in naturetone Sample Storehouse, then judges which speaker in most likely naturetone Sample Storehouse according to the similarity of coupling.Feature extracting method relatively more conventional at present has based on linear predictive coding (Linear PredictiveCoding, LPC) the general coefficient of linear prediction (Linear Predictive Cepstrum Coefficients, LPCC), based on the Mel frequency cepstral coefficient (Mel-scale Frequency CepstralCoefficients, MFCC) of voice principle and acoustical principles; Common method for mode matching has based on dynamic time warping (dynamic timewarping, DTW), vector quantization (Vector Quantization, VQ), hidden Markov model (Hidden Markov Model, and the template matching method etc. of gauss hybrid models (Gaussian Mixture Model, GMM) HMM).

The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, within the spirit and principles in the present invention all, any amendment made, equivalent replacement, improvement etc., all should be included within the scope of protection of the invention.

Claims

1., based on a method for the real-time blocking fraudulent call of clustering algorithm, it is characterized in that, include:

2. method according to claim 1, is characterized in that, also includes between step 2 and step 3:

Adopt logistic regression algorithm, calculate the suspicious degree index of swindle of each calling number in swindle number bunch or doubtful swindle number bunch respectively: wherein, z _ijbe i-th calling number in bunch j, j=1 or 2, bunch 1 is swindle number bunch, and bunches 2 is doubtful swindle numbers bunch, Y (z _ij) be calling number z _ijswindle characteristic value, n is characteristic index number, α _jtthe weight coefficient of the characteristic index t in bunch j, calling number z _ijthe value of characteristic index t, β _jit is the maximum likelihood estimation of bunch j, then judge whether the swindle of calling number suspicious degree index is greater than the threshold value swindling suspicious degree index, if not, described calling number is deleted the swindle number that then belongs to from calling number bunch or doubtful swindle number bunch, the threshold value of the suspicious degree index of described swindle be interval [0,1) between a real number.

3. method according to claim 1, is characterized in that, step one includes further:

Step 12, build three bunches: bunch 1, bunches 2 and bunches 3, and by all calling number random division in three bunches, what wherein each calling number was unique belongs to one bunch;

Step 14, calculate all calling numbers square error and: and judge whether E is less than or equal to the threshold value of E, if so, then this flow process terminates; If not, then calculate the distance between each calling number and the characteristic index central value collection of all bunches again, and therefrom select the minimum value of distance, corresponding to the minimum value then calling number being repartitioned distance bunch in, wherein calling number z _iwith the computing formula of distance between the characteristic index central value collection of bunch j is as follows: x _itcalling number z _ithe value of characteristic index t, then turn to step 13, wherein, the threshold value of E is the number between 0 to 1.

4. method according to claim 1, is characterized in that, when a user initiates a call, includes:

Steps A 1, Client-initiated calling is toggled to service control point (SCP) by calling mobile exchanging center MSC, SCP judges whether the calling number of described call request is collecting evidence in directory or interception directory, if, then return call proceeding message to caller MSC, evidence obtaining routing number or interception routing number information is carried in described call proceeding message, and indicate caller MSC calling to be continued to be toggled to anti-swindle platform, wherein, when calling number is when collecting evidence in directory, evidence obtaining routing number is carried in then call proceeding message, when calling number is when tackling in directory, interception routing number is carried in then call proceeding message.

5. method according to claim 4, is characterized in that, also includes:

When steps A 2, anti-swindle platform receive the call request that caller MSC sends, judge whether carry evidence obtaining routing number in call request, if, then bridge joint is carried out to the voice channel in call request between calling and called, then unidirectional recording is carried out to caller voice, generate a recording file, be then saved in naturetone Sample Storehouse or repeat tone Sample Storehouse by described recording file, this flow process terminates; If not, then next step is continued;

Steps A 3, anti-swindle platform judge whether carry interception routing number in call request, if, then bridge joint is carried out to the voice channel in call request between calling and called, then unidirectional recording is carried out to caller voice, recording S generates a recording file after second, then by recording file one by one with all swindle samples comparison one by one in repeat tone Sample Storehouse, naturetone Sample Storehouse, when recording file and swindle sample are same voice, then illustrate that described recording file is fraudulent call, instruction called MS C interrupts the voice channel between calling and called.

6. method according to claim 5, is characterized in that, in steps A 3, by the swindle sample comparison one by one in recording file and repeat tone Sample Storehouse, includes further:

Steps A 35, respectively from recording file and swindle sample energy eigenvalue concentrate extract before K energy eigenvalue;

The swindle voice confidence level of steps A 39, calculating recording file and swindle sample: wherein, F is the weight coefficient of confidence level, and judges whether the swindle voice confidence level of recording file and swindle sample is greater than the threshold value CC of swindle voice confidence level, and if so, then represent that recording file and swindle sample are same voice, this flow process terminates; If not, then represent that recording file and swindle sample are not same voice, continue from repeat tone Sample Storehouse, read next swindle sample temporal characteristics value collection and energy eigenvalue collection, then turn to steps A 34.

7. based on a system for the real-time blocking fraudulent call of clustering algorithm, it is characterized in that, include anti-swindle platform, wherein, anti-swindle platform includes further:

8. system according to claim 7, is characterized in that, anti-swindle platform also includes:

Logistic regression device, for adopting logistic regression algorithm, calculates the suspicious degree index of swindle of each calling number in swindle number bunch or doubtful swindle number bunch respectively: wherein, z _ijbe i-th calling number in bunch j, j=1 or 2, bunch 1 is swindle number bunch, and bunches 2 is doubtful swindle numbers bunch, Y (z _ij) be calling number z _ijswindle characteristic value, n is characteristic index number, α _jtthe weight coefficient of the characteristic index t in bunch j, calling number z _ijthe value of characteristic index t, β _jbe the maximum likelihood estimation of bunch j, then judge whether the swindle of calling number suspicious degree index is greater than the threshold value swindling suspicious degree index, if not, then delete described calling number the swindle number that belongs to from calling number bunch or doubtful swindle number bunch.

9. system according to claim 7, is characterized in that, cluster analyzing device also includes further:

Bunch adjustment unit, for calculate all calling numbers square error and: and judge whether E is less than or equal to the threshold value of E, if not, then calculate the distance between each calling number and the characteristic index central value collection of all bunches again, and therefrom select the minimum value of distance, then corresponding to minimum value calling number being repartitioned distance bunch in, wherein calling number z _iwith the computing formula of distance between the characteristic index central value collection of bunch j is as follows: x _itcalling number z _ithe value of characteristic index t, finally notify that bunch center calculation unit recalculates the characteristic index central value collection of each bunch, wherein, the threshold value of E is the number between 0 to 1.

10. system according to claim 7, is characterized in that, also includes:

Service control point (SCP), for when receiving calling mobile exchanging center MSC and forwarding the user's call request come, judge whether the calling number of described call request is collecting evidence in directory or interception directory, if, then return call proceeding message to caller MSC, evidence obtaining routing number or interception routing number information is carried in described call proceeding message, and indicate caller MSC calling to be continued to be toggled to anti-swindle platform, wherein, when calling number is when collecting evidence in directory, evidence obtaining routing number is carried in then call proceeding message, when calling number is when tackling in directory, interception routing number is carried in then call proceeding message.

11. systems according to claim 10, is characterized in that, anti-swindle platform also includes:

Swindle blocking apparatus, for carrying out bridge joint to the voice channel in call request between calling and called, then unidirectional recording is carried out to caller voice, recording S generates a recording file after second, again by recording file one by one with all swindle samples comparison one by one in repeat tone Sample Storehouse, naturetone Sample Storehouse, when recording file and swindle sample are same voice, then indicate the voice channel between called MS C interruption calling and called.

12. systems according to claim 11, is characterized in that, swindle blocking apparatus includes repeat tone recognition unit further, and described repeat tone recognition unit includes further:

Temporal characteristics builds parts, for being recording file, or each swindle sample builds respective temporal characteristics value collection in repeat tone Sample Storehouse: from recording file or swindle sample voice starting point, be a frame with n second, from recording file or swindle sample, order extracts G W frame voice messaging one by one, and utilize speech terminals detection technology, calculate the frame number between efficient voice starting point to end point in each W frame voice messaging, described frame number is designated as the temporal characteristics value of described W frame voice messaging, then the temporal characteristics value that the G calculated a temporal characteristics value is saved in recording file or swindle sample according to the precedence in recording file or swindle sample is concentrated,