CN104410973B

CN104410973B - A kind of fraudulent call recognition methods of playback and system

Info

Publication number: CN104410973B
Application number: CN201410668881.4A
Authority: CN
Inventors: 廖建新; 林大庆
Original assignee: BEIJING XINXUN CENTURY INFORMATION TECHNOLOGY Co Ltd
Current assignee: BEIJING XINXUN CENTURY INFORMATION TECHNOLOGY Co Ltd
Priority date: 2014-11-20
Filing date: 2014-11-20
Publication date: 2017-11-28
Anticipated expiration: 2034-11-20
Also published as: CN104410973A

Abstract

A kind of fraudulent call recognition methods of playback and system, including：After voice channel between calling and called is established, caller voice is unidirectionally recorded, generates recording file；A temporal characteristics value collection is built for recording file：The voice starting point of recording file is detected, then since voice starting point, sequentially extracts multiple some frame voice messagings, and efficient voice starting point in each some frame voice messagings to the frame number between end point is saved in temporal characteristics value and concentrated；Read the temporal characteristics value collection of each swindle sample one by one from swindle Sample Storehouse, and the temporal characteristics value in identical sorting position is concentrated to compare one by one recording file and the respective temporal characteristics value of swindle sample, the identical number of temporal characteristics value concentrated so as to calculate recording file with the temporal characteristics value for swindling sample, and judge recording file with this and swindle whether sample is same voice.The invention belongs to network communication technology field, can carry out Real time identification to the fraudulent call of playback.

Description

A kind of fraudulent call recognition methods of playback and system

Technical field

Fraudulent call recognition methods and system the present invention relates to a kind of playback, belong to network communication technology field.

Background technology

With the popularization of mobile phone, telephone fraud emerges in an endless stream.Although warp-wise society sends and carried for relevant government department Waking up, all kinds of news media also report again and again, however, still there is a large number of users to have dust thrown into the eyes daily, and economic loss is in rise year by year Trend.

The speech source of fraudulent call can be divided into two classes：One kind is the mode of playback, i.e., after called phone is put through One section of swindle voice prerecorded is played, this kind of fraudulent call has the feature that source of sound is single, fixed, and due to that can enter The swindle calling of row automation, thus it is large number of；Another kind of is the mode manually swindled, i.e., is entered after connection is called by true man Row swindle, this kind of fraudulent call have a feature that speaker is single, vocal print is fixed, but the problems such as due to personnel's efficiency, so number Measure relatively fewer.This is just based on, how Real time identification is carried out to the fraudulent call of first kind playback, becomes current social It is badly in need of the technical problem solved, for the technical problem, it was also proposed that some related solutions.Such as：Patent application CN 201210182167.5 (application title：System, method and mobile terminal, the high in the clouds Analysis server of telephone fraud are prevented, is applied People：Baidu In Line Network Technology Co Ltd (Beojing), the applying date：2012-06-04) propose it is a kind of prevent telephone fraud be System, including mobile terminal and high in the clouds Analysis server, wherein, the mobile terminal is used for the sound that partner is obtained in call Line feature and dialog context, and the vocal print feature of the partner and dialog context are sent to the high in the clouds Analysis Service Device；And the high in the clouds Analysis server, for respectively by the vocal print feature and the dialog context and the swindler to prestore Vocal print feature database and swindle types of database are compared, to confirm the partner with the presence or absence of swindle, Yi Ji Confirm to send alarm by the user of the mobile terminal to the mobile terminal when partner has swindle.This Individual technical scheme enters vocal print feature and dialog context with the swindler's vocal print feature database and swindle types of database to prestore Row compares, and realizes that technology is more complicated, amount of calculation is larger, it is difficult to meet requirement of real-time higher in practical application, thus it is uncomfortable For carrying out Real time identification and interruption to the fraudulent call in communication process.

Therefore, how Real time identification is carried out to the fraudulent call of playback, being one is worth the technology of further investigation to be asked Topic.

The content of the invention

In view of this, the fraudulent call recognition methods it is an object of the invention to provide a kind of playback and system, can be right The fraudulent call of playback carries out Real time identification.

In order to achieve the above object, the invention provides a kind of fraudulent call recognition methods of playback, include：

Step 1: after the voice channel between calling and called is established, caller voice is unidirectionally recorded, generated after the S seconds of recording One new recording file；

Step 2: build a temporal characteristics value collection for newly-generated recording file：Detect newly-generated recording file Voice starting point, then since voice starting point, multiple some frame voice messagings are sequentially extracted from recording file, and will be every Efficient voice starting point is sequentially saved in the time spy of recording file to the frame number between end point in individual some frame voice messagings Value indicative is concentrated；

Step 3: read the temporal characteristics value collection of each swindle sample one by one from swindle Sample Storehouse, and will be newly-generated Recording file and the respective temporal characteristics value of swindle sample concentrate the temporal characteristics value in identical sorting position to compare one by one, from And the identical several TS of temporal characteristics value that calculate recording file and concentrated with the temporal characteristics value for swindling sample, and judge to record with this Whether sound file and swindle sample are same voices.

In order to achieve the above object, present invention also offers a kind of fraudulent call identifying system of playback, include Identification platform is swindled, swindle Identification platform further comprises having：

Voice bridge-set, the call request sent for receiving calling subscribe, then bridge the voice between calling and called Passage；

Speech voice recording device, after being established for the voice channel between calling and called, caller voice is unidirectionally recorded Sound, a new recording file is generated after the S seconds of recording；

Temporal characteristics value collection construction device, for swindling sample to be each in newly-generated recording file or swindle Sample Storehouse The respective temporal characteristics value collection of this structure：Detect newly-generated recording file or swindle the voice starting point of sample, then from language Sound starting point starts, and multiple some frame voice messagings are sequentially extracted from recording file or swindle sample, and will each some frames Efficient voice starting point is sequentially saved in recording file to the frame number between end point or swindled the time of sample in voice messaging Characteristic value is concentrated；

Speech recognition equipment is swindled, for reading the temporal characteristics value of each swindle sample one by one from swindle Sample Storehouse Collection, and concentrate the time in identical sorting position special newly-generated recording file and the respective temporal characteristics value of swindle sample Value indicative compares one by one, so as to calculate recording file number identical with the temporal characteristics value for the temporal characteristics value concentration for swindling sample TS, and with this come judge recording file and swindle sample whether be same voice.

Compared with prior art, the beneficial effects of the invention are as follows：The present invention intercepts S second voice messagings from caller voice, And temporal characteristics value is calculated according to efficient voice starting point in voice messaging to the frame number between end point, then pass through recording The contrast of the temporal characteristics value of each swindle sample in file and swindle Sample Storehouse, effectively to identify whether caller incoming call is swindleness Phone is deceived, technical scheme is simple and easy and fast and effective, has higher real-time in actual applications；The present invention can be with On the basis of time dimension, the double weft degree Eigenvalues analysis of time and energy is further carried out to caller voice, so as to have Effect distinguishes different phonetic；Meanwhile of the invention always according to ticket writing, the combinational algorithm based on decision tree and logistic regression, from New swindle number is constantly identified in the whole network calling, and recording file corresponding to the swindle number that will identify that is as swindle sample This is constantly saved in swindle speech samples storehouse, so as to which the information swindled in Sample Storehouse can increasingly be enriched, swindles the knowledge of voice Other accuracy also can more and more higher.

Brief description of the drawings

Fig. 1 is a kind of flow chart of the fraudulent call recognition methods of playback of the present invention.

Fig. 2 is the concrete operations flow chart for the double weft degree Eigenvalues analysis for carrying out time and energy in step 3 to voice.

Fig. 3 be the present invention according to ticket writing, and the combinational algorithm based on decision tree and logistic regression, from the whole network calling Identify the concrete operations flow chart of new doubtful swindle number.

Fig. 4 is in Fig. 3 steps A2, for characteristic index T_j, select characteristic index T_jBelong to the section model of fraudulent call The concrete operations flow chart enclosed.

Fig. 5 is a kind of composition structural representation of the swindle Identification platform of fraudulent call identifying system of playback of the present invention Figure.

Fig. 6 is the composition structural representation for the swindle number analysis device for swindling Identification platform.

Fig. 7 is the composition structural representation of decision making algorithm computing unit.

Embodiment

To make the object, technical solutions and advantages of the present invention clearer, the present invention is made below in conjunction with the accompanying drawings further Detailed description.

Because same voice has uniformity in the time shaft distribution in the efficient voice stage, so the present invention can root Whether carry out real-time judge be in caller in talking state and send a telegram here according to the distribution of voice on a timeline is fraudulent call.Such as Fig. 1 It is shown, a kind of fraudulent call recognition methods of playback of the present invention, include：

When receiving the call request that calling subscribe sends, bridge calling and called between voice channel, due to master, Voice call between called is bridged, so the speech data between calling and called all will pass through swindle Identification platform to pass Defeated, for the fraudulent call of playback, the voice of Calling Side is relatively-stationary, and the voice of callee side then can be to master Make voice form interference, so the present invention is only unidirectionally recorded to caller voice, wherein S value can according to being actually needed and Setting；

Step 2: build a temporal characteristics value collection for newly-generated recording file：Detect newly-generated recording file Voice starting point, then since voice starting point, multiple some frame voice messagings are sequentially extracted from recording file, and will be every Efficient voice starting point is sequentially saved in the time spy of recording file to the frame number between end point in individual some frame voice messagings Value indicative is concentrated, wherein it is possible to detect voice starting point and knot using the two-door limit value decision method of short-time energy and zero-crossing rate Spot, to reject the interference for clear band of conversing；

Step 3: read the temporal characteristics value collection of each swindle sample one by one from swindle Sample Storehouse, and will be newly-generated Recording file and the respective temporal characteristics value of swindle sample concentrate the temporal characteristics value in identical sorting position to compare one by one, from And the identical several TS of temporal characteristics value that calculate recording file and concentrated with the temporal characteristics value for swindling sample, and judge to record with this Whether sound file and swindle sample are same voices.For example, when the temporal characteristics value of newly-generated recording file and swindle sample After identical number is more than certain value, then it represents that the newly-generated recording file is same voice with swindle sample, in other words, newly Caller incoming call is fraudulent call corresponding to the recording file of generation.

The present invention builds respective time spy for each swindle sample in newly-generated recording file or swindle Sample Storehouse Value indicative collection, can further include：

Since the voice starting point of recording file or swindle sample, using the n seconds as a frame, one by one from recording file or swindle N number of M frames voice messaging is sequentially extracted in sample, and utilizes speech terminals detection technology, calculate has in each M frames voice messaging Voice starting point is imitated to the frame number between end point, the frame number is designated as to the temporal characteristics value of the M frames voice messaging, then The N number of temporal characteristics value calculated is saved in recording file or swindle according to the precedence in recording file or swindle sample The temporal characteristics value of sample is concentrated.Wherein, n, N, M value can be set according to being actually needed, such as n=10ms, N=100, M=5.By repeatedly testing discovery, most short voice length, which is set in more than the 10s present invention, has preferable implementation result, i.e. N >=100, M=5.

Because distribution of the same voice on amplitude of wave form (energy) also has uniformity, and different phonetic is in time and energy It is minimum to measure all identical possibility on the two latitudes, therefore, the present invention can also be on the basis of time shaft, further The double weft degree Eigenvalues analysis of time and energy is carried out to voice, so as to which effectively voice is swindled in identification.The step two of the present invention It can also include between step 3：

An energy eigenvalue collection is built for newly-generated recording file, wherein, the present invention is newly-generated recording file, Or each swindle sample in swindle Sample Storehouse builds respective energy eigenvalue collection, can further include：

Since the voice starting point of recording file or swindle sample, using the n seconds as a frame, one by one from recording file or swindle M*N frame voice messagings are sequentially extracted in sample, and calculate the short-time energy value of each frame voice messaging, by the short-time energy Value is designated as the energy eigenvalue of every frame voice messaging, then by the M*N energy eigenvalue according to recording file or swindle sample This precedence is saved in recording file or swindles the energy eigenvalue concentration of sample,

Step 3 can also include：

Newly-generated recording file and the respective energy eigenvalue of swindle sample are concentrated into the energy in identical sorting position Measure feature value compares one by one, identical with the energy eigenvalue for the energy eigenvalue concentration for swindling sample so as to calculate recording file Number ES, and with this come judge recording file and swindle sample whether be same voice.

As shown in Fig. 2 in step 3, the double weft degree Eigenvalues analysis of time and energy is carried out to voice, one can also be entered Step includes：

Step 31, read each temporal characteristics value collection for swindling sample one by one from swindle Sample Storehouse；

Step 32, the ordering according to all temporal characteristics values of temporal characteristics value concentration, by newly-generated recording file The temporal characteristics value in identical sorting position is concentrated to compare one by one with the temporal characteristics value of swindle sample, so as to calculate new life Into the identical several TS of temporal characteristics value that are concentrated with the temporal characteristics value of swindle sample of recording file；

Step 33, the energy eigenvalue collection for reading the swindle sample, and respectively from newly-generated recording file and swindle The energy eigenvalue of sample concentrates K energy eigenvalue before extraction, and K value can be set according to being actually needed, such as K=5；

Step 34, the energy multiplication factor for calculating swindle sample and recording file：Wherein, YE_iIt is swindle I-th of energy eigenvalue that the energy eigenvalue of sample is concentrated, GE_iIt is that the energy eigenvalue of newly-generated recording file is concentrated I-th of energy eigenvalue；

Step 35, according to energy multiplication factor B, each energy for being concentrated to the energy eigenvalue of newly-generated recording file Characteristic value is adjusted：GE_i'=B × GE_i, i is 1 to a natural number between M*N, GE_i' it is GE_iEnergy after being adjusted Characteristic value；

Step 36, the ordering according to all energy eigenvalues of energy eigenvalue concentration, by newly-generated recording file The energy eigenvalue in identical sorting position is concentrated to compare one by one with the energy eigenvalue of swindle sample, so as to calculate new life Into the identical several ES of energy eigenvalue that are concentrated with the energy eigenvalue of swindle sample of recording file；

Step 37, calculate recording file and swindle the swindle voice confidence level of sample： Wherein, F is the weight coefficient of confidence level, and judges whether recording file and the swindle voice confidence level of swindle sample are more than threshold value CCIf it is, represent that newly-generated recording file is identical with the voice of swindle sample, i.e., corresponding to newly-generated recording file Caller incoming call can be judged as fraudulent call, remove the voice channel between the calling and called, this flow terminates；If not, Then represent that newly-generated recording file is different with the voice of swindle sample, continue in next step；Wherein, F, threshold value CC value can root Set according to actual conditions, for example, F=0.5, CC=90%；

Step 38, judge whether to have read all swindle samples in swindle Sample StorehouseIf it is, represent newly-generated Recording file with swindle Sample Storehouse in all swindle samples voice it is different, this flow terminates；If it is not, then continue from swindleness The temporal characteristics value collection that next swindle sample is read in Sample Storehouse is deceived, and turns to step 32.

The present invention is also based on the combinational algorithm of decision tree and logistic regression, is regularly identified from the ticket writing of the whole network Go out new swindle number, and swindle Sample Storehouse is saved in again using the recording file corresponding to new swindle number as swindle sample In, so get off, swindling the information of Sample Storehouse can increasingly enrich, also can more and more higher to the recognition correct rate for swindling voice. As shown in figure 3, it is of the invention according to ticket writing, and the combinational algorithm based on decision tree and logistic regression, know from the whole network calling Do not go out new doubtful swindle number, can also include：

Step A1, all ticket writings are extracted, and count several characteristic indexs of each calling number in ticket writing Value, the characteristic index can include but is not limited to：Total talk times, total duration of call, unresponsive number, total ring duration, Actively discharge total degree, passively release total degree etc., then judge whether calling number is the swindle number confirmed, if it is, The swindle ident value of the calling number is then set to 1, if it is not, then the swindle ident value of the calling number is set into 0；

Step A2, according to the characteristic index value of all calling numbers in ticket writing and swindle ident value, using decision tree Algorithm, and the Attribute Selection Criterion formula using information gain as decision tree, select each characteristic index and belong to fraudulent call Interval range；

Step A3, the calling number that a swindle ident value is 0 is extracted from ticket writing；

Step A4, judge whether each characteristic index value of the calling number in corresponding characteristic index belongs to swindleness Deceive in the interval range of phoneIf it is, continue in next step；If it is not, then step A3 is turned to, until having extracted all swindles Ident value is 0 calling number, then this flow terminates；

Step A5, using logistic regression algorithm, the suspicious degree indices P of swindle of the calling number is calculated： Wherein, X is the swindle characteristic value of all characteristic indexs of calling number,V is that the feature of calling number refers to Mark sum, α_jIt is characteristic index T_jWeight coefficient, x_jIt is the characteristic index T of calling number in ticket writing_jValue, β is very big Likelihood estimator, then judges whether the suspicious degree index of swindle of the calling number is more than the threshold value for swindling suspicious degree index If it is, continue in next step；If it is not, then step A3 is turned to, until having extracted the caller number that all swindle ident values are 0 Code, then this flow terminates；

It is described swindle it is suspicious degree index threshold value be section [0,1) between a real number, its value can be according to reality Situation and set, when swindle it is suspicious degree index it is bigger when, calling number is that the possibility of fraudulent call is also bigger, such as is swindled suspicious The threshold value of degree index is arranged to 0.9, when the suspicious degree index of the swindle of calling number is more than or equal to 0.9, then it represents that the caller Number is doubtful swindle number；For α_j, β value, can extraction unit member record is as sample from ticket writing, and to α_j、β Initial value is set, the swindle index and pair of actual swindle ident value then calculated by each calling number in sample Than, then to α_j, β value be adjusted, finally meet that system is actually needed, for example, the weight system of characteristic index " number of calls " System is set to -0.6626, and the weight system of characteristic index " duration of call " is set to 0.004633, the power of characteristic index " ring duration " Weight system is set to -0.001043, and the weight system of characteristic index " called release number " is set to 0.351, β and is set to -6.189；

Step A6, the calling number is written in doubtful swindle directory, and judged whether from ticket writing The calling number that all swindle ident values are 0 is extractedIf it is, this flow terminates；If it is not, then turn to step A3.

As shown in figure 4, in step A2, for characteristic index T_j, select characteristic index T_jBelong to the section of fraudulent call Scope, it can further include：

Step A21, it is characterized index T_jStructure one characteristic index complete or collected works A, the A include all in ticket writing Calling number, and the number that the calling number that ident value is 1 and 0 is swindled in A is designated as p and h respectively, meanwhile, decision tree is set Leaf node layer mark number q is 0；

Step A22, according to the characteristic index T of the A all calling numbers included_jThe distribution of value, by characteristic index T_jValue is drawn It is divided into several interval ranges, and builds several A subset, the subset and characteristic index T of the A_jThe interval range one of value One correspondence, each calling number for then respectively being included A are divided into its characteristic index T_jValue belonging to interval range and it is right In the subset answered, then the information gain value of A each subset is calculated respectively：gain(a_z)=I-E (a_z), wherein, a_zIt is the one of A Individual subset, I are characteristic index T_jThe comentropy of all subsets, andE(a_z) It is subset a_zInformation desired value,p_zIt is subset a_zFor the calling number that middle swindle ident value is 1 Number, h_zIt is subset a_zThe number for the calling number that middle swindle ident value is 0, I (a_z) it is subset a_zComentropy,Finally picked out from the information gain value of all subsets Maximum；

Step A23, q is updated：Q=q+1, and judge whether q is more than the total number of plies Q of decision treeIf it is, maximum information increases Characteristic index T corresponding to the subset of benefit value_jThe interval range of value is characteristic index T_jBelong to the section model of fraudulent call Enclose, this flow terminates；If it is not, then A is updated into the subset corresponding to maximum information yield value, step A22 is then proceeded to.Its In, Q value can according to circumstances depending on, for example, Q=5.

For example, by above-mentioned steps, we can respectively obtain " total talk times ", " total duration of call " and " during ring This 3 characteristic indexs of length " belong to the interval range of fraudulent call：Characteristic index " total talk times " belongs to fraudulent call Interval range is greater than or equal to 22 times；The interval range that characteristic index " total duration of call " belongs to fraudulent call is [5,10] (unit：Minute)；The interval range that characteristic index " ring duration " belongs to fraudulent call is greater than or equal to 30 seconds.So, When " the total talk times " of some calling number are more than or equal to 22 times, " total duration of call " interior and " ring in section [5,10] When duration " is more than or equal to 30 seconds, the calling number is then likely to be doubtful swindle number, then enters in conjunction with logistic regression algorithm The judgement of row next step.

So, when the calling number in call is the doubtful swindle number in doubtful swindle directory, can manually examine The mode of core is confirmed whether it is fraudulent call, if fraudulent call, then using newly-generated recording file as swindle sample Real-time update so as to the information in swindle Sample Storehouse of enriching constantly, and effectively improves the knowledge of swindle voice into swindle Sample Storehouse Other accuracy.Thus,, can be with when having read all swindle samples in swindle Sample Storehouse in the step 38 of step 3 Include：

Judge calling number whether in doubtful swindle directoryIf it is, pedestrian is entered to newly-generated recording file Work is audited, after fraudulent call is confirmed as, by the temporal characteristics value collection of the newly-generated recording file and recording file and Energy eigenvalue collection is as swindle Sample preservation into swindle Sample Storehouse；If it is not, then this flow terminates.

A kind of fraudulent call identifying system of playback of the present invention, includes swindle Identification platform, as shown in figure 5, swindleness Deceive Identification platform and further comprise having：

Temporal characteristics value collection construction device, for swindling sample to be each in newly-generated recording file or swindle Sample Storehouse The respective temporal characteristics value collection of this structure：Detect newly-generated recording file or swindle the voice starting point of sample, then from language Sound starting point starts, and multiple some frame voice messagings are sequentially extracted from recording file or swindle sample, and will each some frames Efficient voice starting point is sequentially saved in recording file to the frame number between end point or swindled the time of sample in voice messaging Characteristic value concentrate, wherein it is possible to using the two-door limit value decision method of short-time energy and zero-crossing rate come detect voice starting point and End point, to reject the interference for clear band of conversing；

Speech recognition equipment is swindled, for reading the temporal characteristics value of each swindle sample one by one from swindle Sample Storehouse Collection, and concentrate the time in identical sorting position special newly-generated recording file and the respective temporal characteristics value of swindle sample Value indicative compares one by one, so as to calculate recording file number identical with the temporal characteristics value for the temporal characteristics value concentration for swindling sample TS, and with this come judge recording file and swindle sample whether be same voice.For example, when newly-generated recording file and swindle After the identical number of temporal characteristics value of sample is more than certain threshold value, then it represents that the newly-generated recording file is phase with swindle sample Same voice, in other words, caller incoming call is fraudulent call corresponding to newly-generated recording file.

Temporal characteristics value collection construction device can further include：

Temporal characteristics value computing unit, for since recording file or swindle sample voice starting point, using the n seconds as one Frame, N number of M frames voice messaging is sequentially extracted from recording file or swindle sample one by one, and utilizes speech terminals detection technology, Calculate efficient voice starting point in each M frames voice messaging and, to the frame number between end point, the frame number is designated as the M frames language The temporal characteristics value of message breath, then by the N number of temporal characteristics value calculated according to the priority in recording file or swindle sample Order is saved in recording file or swindles the temporal characteristics value concentration of sample.

The present invention can also further carry out the double weft degree Eigenvalues analysis of time and energy to voice, swindle Identification platform It can also include：

Energy eigenvalue collection construction device, for swindling sample to be each in newly-generated recording file or swindle Sample Storehouse The respective energy eigenvalue collection of this structure：Since the voice starting point of recording file or swindle sample, using the n seconds as a frame, by One sequentially extracts M*N frame voice messagings from recording file or swindle sample, and calculates in short-term for each frame voice messaging Value, the short-time energy value is designated as to the energy eigenvalue of every frame voice messaging, then pressed the M*N energy eigenvalue The energy eigenvalue concentration of recording file or swindle sample is saved according to recording file or the precedence for swindling sample,

Swindle speech recognition equipment can further include：

The identical several computing units of temporal characteristics value, for newly-generated recording file and swindle sample respective time is special Value indicative concentrates the temporal characteristics value in identical sorting position to compare one by one, so as to calculate newly-generated recording file and swindle The identical several TS of temporal characteristics value that the temporal characteristics value of sample is concentrated, and with this come judge recording file and swindle sample whether be Same voice；

Energy multiplication factor computing unit, for being concentrated from the energy eigenvalue of newly-generated recording file and swindle sample K energy eigenvalue before extraction, then calculate the energy multiplication factor of swindle sample and recording file：Wherein, YE_iIt is to swindle i-th of energy eigenvalue that the energy eigenvalue of sample is concentrated, GE_iIt is the energy feature of newly-generated recording file It is worth i-th of the energy eigenvalue concentrated, further according to energy multiplication factor B, to the energy eigenvalue collection of newly-generated recording file In each energy eigenvalue be adjusted：GE_i'=B × GE_i, i is 1 to a natural number between M*N, GE_i' it is GE_iQuilt Energy eigenvalue after adjustment；

The identical several computing units of energy eigenvalue, for newly-generated recording file and the respective energy of swindle sample is special Value indicative concentrates the energy eigenvalue in identical sorting position to compare one by one, so as to calculate recording file and swindle the energy of sample Measure feature value concentrate the identical several ES of energy eigenvalue, and with this come judge recording file and swindle sample whether be identical language Sound；

Confidence computation unit is swindled, for calculating newly-generated recording file and swindling the swindle voice confidence of sample Degree：Wherein, F is the weight coefficient of confidence level, and judges recording file and swindle Whether the swindle voice confidence level of sample is more than threshold value CC, if it is, representing newly-generated recording file and swindle sample Voice is identical, i.e., caller incoming call can be judged as fraudulent call corresponding to newly-generated recording file, remove the calling and called Between voice channel；If it is not, then represent that newly-generated recording file is different with the voice of swindle sample；

Sample Refreshment unit is swindled, for when the swindleness of all swindle samples in newly-generated recording file and swindle Sample Storehouse When deceiving voice confidence level and being all not more than threshold value CC, calling number is judged whether in doubtful swindle directory, if it is, to new The recording file of generation carries out manual examination and verification, after fraudulent call is confirmed as, by the newly-generated recording file and recording The temporal characteristics value collection and energy eigenvalue collection of file are as swindle Sample preservation into swindle Sample Storehouse.

The present invention is also based on the combinational algorithm of decision tree and logistic regression, is regularly identified from the ticket writing of the whole network Go out new swindle number, and swindle Sample Storehouse is saved in again using the recording file corresponding to new swindle number as swindle sample In, so get off, swindling the information of Sample Storehouse can increasingly enrich, also can more and more higher to the recognition correct rate for swindling voice. So, swindle Identification platform can also include swindle number analysis device, as shown in fig. 6, swindle number analysis device enters one Step includes：

Characteristic index setting unit, for counting several characteristic index values of each calling number in ticket writing, and Calling number is judged whether in directory is intercepted, if it is, the swindle ident value of the calling number is set to 1, if not, The swindle ident value of the calling number is then set to 0；

Decision making algorithm computing unit, identified for the characteristic index value according to all calling numbers in ticket writing and swindle Value, using decision Tree algorithms, and the Attribute Selection Criterion formula using information gain as decision tree, select each characteristic index and return Belong to the interval range of fraudulent call；Each swindle ident value is extracted from ticket writing one by one and is 0 calling number, and is judged Each characteristic index value of the calling number whether in the interval range that corresponding characteristic index belongs to fraudulent call, If it is, represent that the calling number is doubtful fraudulent call；

Logistic regression computing unit, for using logistic regression algorithm, each swindle mark is extracted from ticket writing one by one Knowledge value is 0 calling number, and the suspicious degree indices P of swindle for calculating the calling number：Wherein, X is caller The swindle characteristic value of all characteristic indexs of number,V is that the characteristic index of calling number is total, α_jIt is Characteristic index T_jWeight coefficient, x_jIt is the characteristic index T of calling number in ticket writing_jValue, β is maximum likelihood estimation, Then judge whether the suspicious degree index of swindle of the calling number is more than the threshold value for swindling suspicious degree index, if it is, table Show that the calling number is doubtful fraudulent call, the threshold value of the suspicious degree index of swindle described in the calling number be section [0, 1) real number between；

Number storage unit is swindled, for the doubtful swindleness for judging decision making algorithm computing unit or logistic regression computing unit Phone is deceived to be written in doubtful swindle directory.

As shown in fig. 7, decision making algorithm computing unit can further include：

Decision tree grows part, and for being respectively that each characteristic index builds a characteristic index complete or collected works A, the A includes Have all calling numbers in ticket writing, and the number that the calling number that ident value is 1 and 0 will be swindled in A respectively be designated as p and H, meanwhile, it is 0 to set decision-making leaf nodes layer mark number q, and the parameters such as A, p, h then are sent into decision-making leaf nodes part； After the subset of decision-making leaf nodes part return is received, q is updated：Q=q+1, and judge whether q is more than the total number of plies of decision tree Q, if it is, the interval range of the characteristic index value corresponding to the subset that decision-making leaf nodes part is sent is corresponding Characteristic index belongs to the interval range of fraudulent call, if it is not, then A is updated into the son that decision-making leaf nodes part sends Collection, then proceedes to the parameters such as A, p, h being sent to decision-making leaf nodes part；

Decision-making leaf nodes part, for the distribution of the characteristic index value of all calling numbers included according to A, by spy Sign desired value is divided into several interval ranges, and builds several A subset, the subset of the A and the area of characteristic index value Between scope correspond, each calling number for then respectively being included A is divided into the section model belonging to its characteristic index value Enclose and in corresponding subset, then respectively calculate A each subset information gain value：gain(a_z)=I-E (a_z), wherein, a_zIt is A a subset, I are characteristic index T_jThe comentropy of all subsets, and E(a_z) it is subset a_zInformation desired value,p_zIt is subset a_zThe caller number that middle swindle ident value is 1 The number of code, h_zIt is subset a_zThe number for the calling number that middle swindle ident value is 0, I (a_z) it is subset a_zComentropy,Finally the subset corresponding to maximum information yield value is returned Part is grown back to decision tree.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention God any modification, equivalent substitution and improvements done etc., should be included within the scope of protection of the invention with principle.

Claims

1. the fraudulent call recognition methods of a kind of playback, it is characterised in that include：

Step 1: after the voice channel between calling and called is established, caller voice is unidirectionally recorded, one is generated after the S seconds of recording New recording file；

Step 2: build a temporal characteristics value collection for newly-generated recording file：Detect the voice of newly-generated recording file Starting point, then since voice starting point, multiple some frame voice messagings are sequentially extracted from recording file, and if will be each Efficient voice starting point is sequentially saved in the temporal characteristics value of recording file to the frame number between end point in dry frame voice messaging Concentrate；

Step 3: read the temporal characteristics value collection of each swindle sample one by one from swindle Sample Storehouse, and by newly-generated recording File and the respective temporal characteristics value of swindle sample concentrate the temporal characteristics value in identical sorting position to compare one by one, so as to count Recording file several TSs identical with the temporal characteristics value for the temporal characteristics value concentration for swindling sample are calculated, and it is literary to judge to record with this Whether part and swindle sample are same voices.

2. according to the method for claim 1, it is characterised in that in newly-generated recording file or swindle Sample Storehouse Each swindle sample builds respective temporal characteristics value collection, further comprises having：

Since the voice starting point of recording file or swindle sample, using the n seconds as a frame, one by one from recording file or swindle sample Middle order extracts N number of M frames voice messaging, and utilizes speech terminals detection technology, calculates effective language in each M frames voice messaging The frame number is designated as the temporal characteristics value of the M frames voice messaging, then will counted by sound starting point to the frame number between end point The N number of temporal characteristics value calculated is saved in recording file or swindle sample according to the precedence in recording file or swindle sample Temporal characteristics value concentrate.

3. according to the method for claim 2, it is characterised in that also include between step 2 and step 3：To be newly-generated Recording file and swindle Sample Storehouse in each swindle sample build respective energy eigenvalue collection,

Respective energy eigenvalue collection is built for each swindle sample in newly-generated recording file or swindle Sample Storehouse, is entered One step includes：

Since the voice starting point of recording file or swindle sample, using the n seconds as a frame, one by one from recording file or swindle sample Middle order extracts M*N frame voice messagings, and calculates the short-time energy value of each frame voice messaging, and the short-time energy value is remembered For the energy eigenvalue of every frame voice messaging, the M*N energy eigenvalue according to recording file or is then swindled into sample Precedence is saved in recording file or swindles the energy eigenvalue concentration of sample,

Step 3 also includes：

Concentrate the energy in identical sorting position special newly-generated recording file and the respective energy eigenvalue of swindle sample Value indicative compares one by one, so as to calculate recording file number identical with the energy eigenvalue for the energy eigenvalue concentration for swindling sample ES, and with this come judge recording file and swindle sample whether be same voice.

4. according to the method for claim 3, it is characterised in that step 3 also includes：

Calculate recording file and swindle the swindle voice confidence level of sample：Wherein, F It is the weight coefficient of confidence level, and judges whether recording file and the swindle voice confidence level of swindle sample are more than threshold value CC, such as Fruit is, then it represents that newly-generated recording file is identical with the voice of swindle sample, and the voice removed between the calling and called leads to Road.

5. according to the method for claim 4, it is characterised in that calculate recording file and swindle the swindle voice confidence of sample Before degree, also include：

The energy eigenvalue collection of swindle sample is read, and respectively from newly-generated recording file and the energy eigenvalue of swindle sample Concentrate K energy eigenvalue before extracting；

Calculate the energy multiplication factor of swindle sample and recording file：Wherein, YE_iIt is the energy for swindling sample I-th of energy eigenvalue that characteristic value is concentrated, GE_iIt is i-th of energy that the energy eigenvalue of newly-generated recording file is concentrated Characteristic value；

According to energy multiplication factor B, each energy eigenvalue concentrated to the energy eigenvalue of newly-generated recording file is carried out Adjustment：GE_i'=B × GE_i, i is 1 to a natural number between M*N, GE_i' it is GE_iEnergy eigenvalue after being adjusted.

6. according to the method for claim 4, it is characterised in that all swindle samples in recording file and swindle Sample Storehouse When this swindle voice confidence level is all not more than threshold value CC, also include：

Calling number is judged whether in doubtful swindle directory, if it is, manually being examined newly-generated recording file Core, after fraudulent call is confirmed as, by the newly-generated recording file and the temporal characteristics value collection and energy of recording file Characteristic value collection is as swindle Sample preservation into swindle Sample Storehouse.

7. according to the method for claim 1, it is characterised in that also include：

Step A1, all ticket writings are extracted, and count several characteristic index values of each calling number in ticket writing, so Judge whether calling number is the swindle number confirmed afterwards, if it is, the swindle ident value of the calling number is set into 1, such as Fruit is no, then the swindle ident value of the calling number is set into 0；

Step A2, according to the characteristic index value of all calling numbers in ticket writing and swindle ident value, using decision Tree algorithms, And the Attribute Selection Criterion formula using information gain as decision tree, select the section that each characteristic index belongs to fraudulent call Scope；

Step A4, judge whether each characteristic index value of the calling number belongs in corresponding characteristic index and swindle electricity In the interval range of words, if it is, continuing in next step；If it is not, then step A3 is turned to, until having extracted all swindle marks It is worth the calling number for 0, then this flow terminates；

Step A5, using logistic regression algorithm, the suspicious degree indices P of swindle of the calling number is calculated：Wherein, X is the swindle characteristic value of all characteristic indexs of calling number,V is that the characteristic index of calling number is total Number, α_jIt is characteristic index T_jWeight coefficient, x_jIt is the characteristic index T of calling number in ticket writing_jValue, β is maximum likelihood Estimate, then judges whether the suspicious degree index of swindle of the calling number is more than the threshold value for swindling suspicious degree index, if It is then to continue in next step；If it is not, then step A3 is turned to, until having extracted the calling number that all swindle ident values are 0, then This flow terminates, it is described swindle it is suspicious degree index threshold value be section [0,1) between a real number,

Step A6, the calling number is written in doubtful swindle directory, and judges whether to extract from ticket writing The calling number that complete all swindle ident values are 0, if it is, this flow terminates；If it is not, then turn to step A3.

8. according to the method for claim 7, it is characterised in that in step A2, for characteristic index T_j, select feature and refer to Mark T_jThe interval range of fraudulent call is belonged to, further comprises having：

Step A21, it is characterized index T_jStructure one characteristic index complete or collected works A, the A include all callers in ticket writing Number, and the number that the calling number that ident value is 1 and 0 is swindled in A is designated as p and h respectively, meanwhile, decision tree leaf segment is set Point layer mark number q is 0；

Step A22, according to the characteristic index T of the A all calling numbers included_jThe distribution of value, by characteristic index T_jValue is divided into Several interval ranges, and several A subset is built, the subset and characteristic index T of the A_jA pair of the interval range 1 of value Should, each calling number for then respectively being included A is divided into its characteristic index T_jValue belonging to interval range and it is corresponding In subset, then the information gain value of A each subset is calculated respectively：gain(a_z)=I-E (a_z), wherein, a_zIt is A Collection, I is characteristic index T_jThe comentropy of all subsets, andE(a_z) it is son Collect a_zInformation desired value,p_zIt is subset a_zFor the calling number that middle swindle ident value is 1 Number, h_zIt is subset a_zThe number for the calling number that middle swindle ident value is 0, I (a_z) it is subset a_zComentropy,Finally picked out from the information gain value of all subsets Maximum；

Step A23, q is updated：Q=q+1, and judge whether q is more than the total number of plies Q of decision tree, if it is, maximum information yield value Subset corresponding to characteristic index T_jThe interval range of value is characteristic index T_jThe interval range of fraudulent call is belonged to, this Flow terminates；If it is not, then A is updated into the subset corresponding to maximum information yield value, step A22 is then proceeded to.

9. a kind of fraudulent call identifying system of playback, includes swindle Identification platform, it is characterised in that swindle identification is flat Platform further comprises having：

Voice bridge-set, the call request sent for receiving calling subscribe, the voice then bridged between calling and called lead to Road；

Speech voice recording device, after being established for the voice channel between calling and called, caller voice is unidirectionally recorded, recorded A new recording file is generated after the sound S seconds；

Temporal characteristics value collection construction device, for swindling sample structure to be each in newly-generated recording file or swindle Sample Storehouse Build respective temporal characteristics value collection：Detect newly-generated recording file or swindle the voice starting point of sample, then from voice Initial point starts, and multiple some frame voice messagings are sequentially extracted from recording file or swindle sample, and will each some frame voices Efficient voice starting point is sequentially saved in recording file to the frame number between end point or swindles the temporal characteristics of sample in information Value is concentrated；

Speech recognition equipment is swindled, for reading the temporal characteristics value collection of each swindle sample one by one from swindle Sample Storehouse, and Newly-generated recording file and the respective temporal characteristics value of swindle sample are concentrated into the temporal characteristics value in identical sorting position Compare one by one, so as to calculate recording file several TSs identical with the temporal characteristics value for the temporal characteristics value concentration for swindling sample, and With this come judge recording file and swindle sample whether be same voice.

10. system according to claim 9, it is characterised in that temporal characteristics value collection construction device further comprises having：

Temporal characteristics value computing unit, for since recording file or swindle sample voice starting point, using the n seconds as a frame, N number of M frames voice messaging is sequentially extracted from recording file or swindle sample one by one, and utilizes speech terminals detection technology, is counted Calculate efficient voice starting point in each M frames voice messaging and, to the frame number between end point, the frame number is designated as the M frames voice The temporal characteristics value of information, then by the N number of temporal characteristics value calculated according to the priority time in recording file or swindle sample Sequence is saved in recording file or swindles the temporal characteristics value concentration of sample.

11. system according to claim 10, it is characterised in that swindle Identification platform also includes：

Energy eigenvalue collection construction device, for swindling sample structure to be each in newly-generated recording file or swindle Sample Storehouse Build respective energy eigenvalue collection：Since recording file or swindle sample voice starting point, using the n seconds as a frame, one by one from M*N frame voice messagings are sequentially extracted in recording file or swindle sample, and calculate the short-time energy of each frame voice messaging Value, the short-time energy value is designated as to the energy eigenvalue of every frame voice messaging, then by the M*N energy eigenvalue according to The precedence of recording file or swindle sample is saved in recording file or swindles the energy eigenvalue concentration of sample,

Swindle speech recognition equipment also includes：

The identical several computing units of energy eigenvalue, for by newly-generated recording file and swindle the respective energy eigenvalue of sample The energy eigenvalue in identical sorting position is concentrated to compare one by one, so as to calculate recording file and swindle the energy spy of sample Value indicative concentrate the identical several ES of energy eigenvalue, and with this come judge recording file and swindle sample whether be same voice.

12. system according to claim 11, it is characterised in that swindle speech recognition equipment also includes：

Confidence computation unit is swindled, for calculating newly-generated recording file and swindling the swindle voice confidence level of sample：Wherein, F is the weight coefficient of confidence level, and judges recording file and swindle sample Swindle voice confidence level whether be more than threshold value CC, if it is, representing the voice of newly-generated recording file and swindle sample It is identical, remove the voice channel between the calling and called；If it is not, then represent newly-generated recording file and swindle sample Voice is different.

13. system according to claim 12, it is characterised in that swindle speech recognition equipment also includes：

Energy multiplication factor computing unit, for concentrating extraction from the energy eigenvalue of newly-generated recording file and swindle sample Preceding K energy eigenvalue, then calculate the energy multiplication factor of swindle sample and recording file：Wherein, YE_i It is to swindle i-th of energy eigenvalue that the energy eigenvalue of sample is concentrated, GE_iIt is the energy eigenvalue of newly-generated recording file I-th of the energy eigenvalue concentrated, further according to energy multiplication factor B, the energy eigenvalue of newly-generated recording file is concentrated Each energy eigenvalue be adjusted：GE_i'=B × GE_i, i is 1 to a natural number between M*N, GE_i' it is GE_iAdjusted Energy eigenvalue after whole, or,

Sample Refreshment unit is swindled, for when the swindle of all swindle samples in newly-generated recording file and swindle Sample Storehouse When voice confidence level is all not more than threshold value CC, calling number is judged whether in doubtful swindle directory, if it is, to new life Into recording file carry out manual examination and verification, after fraudulent call is confirmed as, by the newly-generated recording file and recording text The temporal characteristics value collection and energy eigenvalue collection of part are as swindle Sample preservation into swindle Sample Storehouse.

14. system according to claim 9, it is characterised in that swindle Identification platform also includes swindle number analysis dress Put, wherein, swindle number analysis device further comprises having：

Characteristic index setting unit, for counting several characteristic index values of each calling number in ticket writing, and judge Whether calling number is in directory is intercepted, if it is, the swindle ident value of the calling number is set into 1, if it is not, then will The swindle ident value of the calling number is set to 0；

Decision making algorithm computing unit, for the characteristic index value according to all calling numbers in ticket writing and ident value is swindled, Using decision Tree algorithms, and the Attribute Selection Criterion formula using information gain as decision tree, select each characteristic index ownership In the interval range of fraudulent call；Each swindle ident value is extracted from ticket writing one by one and is 0 calling number, and judges institute Each characteristic index value of calling number is stated whether in the interval range that corresponding characteristic index belongs to fraudulent call, such as Fruit is, then it represents that the calling number is doubtful fraudulent call；

Logistic regression computing unit, for using logistic regression algorithm, each swindle ident value is extracted from ticket writing one by one For 0 calling number, and the suspicious degree indices P of swindle for calculating the calling number：Wherein, X is calling number All characteristic indexs swindle characteristic value,V is that the characteristic index of calling number is total, α_jIt is feature Index T_jWeight coefficient, x_jIt is the characteristic index T of calling number in ticket writing_jValue, β is maximum likelihood estimation, then Judge whether the suspicious degree index of swindle of the calling number is more than the threshold value for swindling suspicious degree index, if it is, representing institute It is doubtful fraudulent call to state calling number, the threshold value of the suspicious degree index of swindle described in the calling number be section [0,1) it Between a real number；

Number storage unit is swindled, for the doubtful swindle of decision making algorithm computing unit or the judgement of logistic regression computing unit is electric Words are written in doubtful swindle directory.

15. system according to claim 14, it is characterised in that decision making algorithm computing unit further comprises having：

Decision tree grows part, for being respectively that each characteristic index structure one characteristic index complete or collected works A, the A include words All calling numbers in unirecord, and the number that the calling number that ident value is 1 and 0 is swindled in A is designated as p and h respectively, together When, it is 0 to set decision-making leaf nodes layer mark number q, and A, p, h then are sent into decision-making leaf nodes part；When receiving certainly After the subset that plan leaf nodes part returns, q is updated：Q=q+1, and judge whether q is more than the total number of plies Q of decision tree, if it is, The interval range for the characteristic index value corresponding to subset that then decision-making leaf nodes part is sent is corresponding characteristic index The interval range of fraudulent call is belonged to, if it is not, then A is updated into the subset that decision-making leaf nodes part sends, then Continue A, p, h being sent to decision-making leaf nodes part；

Decision-making leaf nodes part, for the distribution of the characteristic index value of all calling numbers included according to A, feature is referred to Scale value is divided into several interval ranges, and builds several A subset, the subset of the A and the section model of characteristic index value Enclose one-to-one corresponding, each calling number for then respectively being included A be divided into the interval range belonging to its characteristic index value and In corresponding subset, then the information gain value of A each subset is calculated respectively：gain(a_z)=I-E (a_z), wherein, a_zIt is A A subset, I are characteristic index T_jThe comentropy of all subsets, andE (a_z) it is subset a_zInformation desired value,p_zIt is subset a_zThe calling number that middle swindle ident value is 1 Number, h_zIt is subset a_zThe number for the calling number that middle swindle ident value is 0, I (a_z) it is subset a_zComentropy,Finally the subset corresponding to maximum information yield value is returned Part is grown back to decision tree.