CN112866486A - Multi-source feature-based fraud telephone identification method, system and equipment - Google Patents

Multi-source feature-based fraud telephone identification method, system and equipment Download PDF

Info

Publication number
CN112866486A
CN112866486A CN202110138462.XA CN202110138462A CN112866486A CN 112866486 A CN112866486 A CN 112866486A CN 202110138462 A CN202110138462 A CN 202110138462A CN 112866486 A CN112866486 A CN 112866486A
Authority
CN
China
Prior art keywords
fraud
user
data
call
vertex
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110138462.XA
Other languages
Chinese (zh)
Other versions
CN112866486B (en
Inventor
赵玺
褚启伍
任一民
邹建华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN202110138462.XA priority Critical patent/CN112866486B/en
Publication of CN112866486A publication Critical patent/CN112866486A/en
Application granted granted Critical
Publication of CN112866486B publication Critical patent/CN112866486B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/22Arrangements for supervision, monitoring or testing
    • H04M3/2281Call monitoring, e.g. for law enforcement purposes; Call tracing; Detection or prevention of malicious calls
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/251Fusion techniques of input or preprocessed data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/285Selection of pattern recognition techniques, e.g. of classifiers in a multi-classifier system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention discloses a method, a system and equipment for identifying fraud calls based on multi-source characteristics, wherein the method comprises the following steps: the user selects the normal number, the promotion number and the fraud number, constructs a user classification more fitting to the reality, and based on the selected user multi-source characteristic indexes including the basic characteristics of the user call data, the basic call characteristics of the user, the portrait characteristics, the user position and the internet access characteristics, and extracting the structural characteristics of the user second-degree network based on a Struct2Vec graph network model with similar graph structures, identifying the fraud mode structures such as a multi-point first-line network and the like, converting the user second-degree call data into call time sequence data based on the user second-degree call data, extracting the characteristic combination based on time sequence, on the basis of constructing multisource characteristics, a Borderline-SMOTE balance sample data set is utilized to finally construct a classification model for normal, fraud and promotion identification, the model utilizes a plurality of different integrated learning combination modes to carry out training prediction, and a black and white list filtering mechanism is combined to realize accurate and effective identification of fraud calls.

Description

Multi-source feature-based fraud telephone identification method, system and equipment
Technical Field
The invention belongs to the technical field of information, and particularly relates to a method, a system and equipment for recognizing fraud calls based on multi-source characteristics.
Background
With the continuous development of the communication industry, more and more users enjoy the convenience brought by communication to life, but at the same time, more and more fraudulent calling behaviors are continuously emerging, a large number of groups or individuals harass target groups by means of fraud, personal attack and the like and cheat related to money, and the fraudulent calls like this are endless in life, which seriously affects the use experience of the users and brings great inconvenience to the daily life of the users.
Currently, methods for identifying fraudulent calls are based on some basic feature extraction and are identified by machine learning or deep learning, such as: a fraud application detection method based on deep learning, [ fraud phone analysis method based on multi-dimensional time series, [ fraud phone number identification method and system ], etc., also a method of identification using fraud patterns, such as: [ A fraud phone detection method based on intent understanding technology ], [ a fraud phone identification method based on graph embedding ], etc., but in summary of the above patents, we find some prior patents not concerned:
in real life, there are many other numbers similar to fraud telephones with high outgoing call and high frequency, such as: the telephone system comprises a sales promotion telephone, an express delivery take-away telephone, a taxi drip telephone, a company telephone and the like, wherein the telephones have a great interference effect on the identification of fraud telephones, particularly the sales promotion telephone, has a short life cycle and is difficult to effectively identify in time through a black and white list, but the prior art does not mention various interference situations in the actual environment.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a fraud telephone identification method based on multisource characteristics, which is characterized in that a promotional telephone is separately proposed as a type-I + black and white list filtering mode for identification, so that the fraud telephone identification method can be more suitable for the actual communication environment, and the fraud telephone identification effect is greatly improved.
In order to achieve the purpose, the invention adopts the technical scheme that: a fraud telephone identification method based on multi-source characteristics comprises the following steps:
establishing second-degree call data, position data and internet surfing data of users of three categories including normal numbers, promotion numbers and fraud numbers, and extracting basic characteristics of the users based on the second-degree call data, the position data and the internet surfing data;
constructing a Struct2Vec graph network model based on graph structure similarity based on second-degree call data of a user, extracting structural features of a second-degree network of the user, and identifying a multi-point one-line fraud mode structure;
converting first-time call data of a user into the call time sequence data of the user, and constructing a call time sequence characteristic combination based on time sequence according to the call time sequence data;
combining the basic characteristics of the user, the structural characteristics of a user two-degree network and the conversation time sequence characteristics, and constructing a characteristic sample data set by adopting a characteristic filtering and dimension reduction mode;
balancing the feature sample data set by using an oversampling method Borderline-SMOTE;
and constructing a black and white list mechanism, constructing a plurality of different ensemble learning combinations including boosting and bagging based on the balanced feature sample data set, constructing a normal, fraud and promotion identification fusion classification model by adopting a weight value distribution mode based on the ensemble learning combinations, and identifying fraud calls.
The basic characteristics of the user comprise basic call characteristics, portrait information characteristics, position information characteristics and internet surfing information characteristics of the user.
Based on a Struct2Vec graph network model with similar graph structures, the structural features of a user second-degree network are extracted, and a multipoint first-line fraud mode structure is identified as follows:
constructing a graph by using a two-degree call network, and acquiring a vertex pair distance of each layer network based on each vertex of the network, wherein the layer takes the vertex as an origin, the first-degree network is a first layer, the second-degree network is a second layer, and the like;
vertex pair distance fk(u, v) are:
fk(u,v)=fk-1(u,v)+g(s(Rk(u)),s(Rk(v))),k≥0 and|Rk(u)|,|Rk(v)|>0
wherein R isk(u) denotes the set of vertices with a distance k to vertex u, Rk(v) Denotes the set of vertices, S (R), with a distance k to the vertex vk(u)) represents a vertex set Rk(u) the order degree sequence, that is, the set vertices are all vertices with a distance k to the vertex u, and are arranged in order according to the degrees of the vertices;
g (D1, D2) ≧ 0 is a function measuring the distance between the order sequences D1 and D2, i.e., the distance between the two order sequences, and based on Dynamic Time Warping, the distance function between the elements is defined as:
Figure BDA0002927916690000031
fk(u, v) represents the structural distance on the loop between vertices u and v at a distance k, where k is actually a set of nodes at distances equal to or less than k, and f is added each timek-1(u, v), iteratively adding, which is a function of vertex to distance;
calculating a distance between two vertexes for each k, and constructing a weighted hierarchical graph through the ordered degree sequence distance between the vertexes for subsequent random walk;
defining the edge weights of two vertices in a certain layer k as
Figure BDA0002927916690000032
The edge weights are all less than 1, and the edge weight is 1 if and only if the distance is 0;
connecting the same vertexes belonging to different levels through directed edges, namely connecting each vertex with the corresponding same upper-layer vertex and lower-layer vertex to obtain a weighted level graph;
sampling vertex sequences in the weighted hierarchical graph based on a random walk mode, selecting each vertex as a starting point, performing random walk to obtain a sequence of points, then regarding the sequence as a sentence, learning by using word2vec to obtain a representation embedded feature vector of each vertex, excavating a structure of each vertex in a two-degree network of the vertex, and generating the embedded feature vector to obtain the structure for identifying the multi-point one-line fraud mode.
Converting the call data based on the user into call time sequence data, and constructing a call time sequence characteristic combination based on the time sequence as follows:
establishing conversation time sequence data of a user based on conversation and time length of the conversation which occurs at each moment of the user, conversation times of each day in a set time period and conversation intervals which occur at each moment of the user, and extracting time sequence characteristics according to the time sequence data; and inputting the constructed time sequence data by using the open source packet tsfresh in python, and outputting the set time sequence characteristics.
The characteristic sample data set is balanced by using an improved SMOTE oversampling method Borderline-SMOTE as follows:
dividing the characteristic sample data set into a training set and a test set according to a preset proportion, wherein the test set is unchanged;
based on Borderline-SMOTE oversampling technology, operating a training set, and classifying a few fraud samples in the training set into 3 classes, namely Safe, Danger and Noise, wherein the Safe class is that more than one half of the samples are all few samples, the Danger class is that more than one half of the samples are all majority samples, the samples are regarded as the samples on the boundary, and the Noise class is that the samples are all majority samples, and the Noise class is that the samples are regarded as the Noise;
and (4) oversampling the minority class of the Danger class, randomly selecting the minority class sample by adopting a K neighbor method, and oversampling the minority class sample.
Based on multi-source characteristic data, a plurality of different integrated learning combinations including boosting and bagging are constructed, classification models of normal, fraud and promotion identification are constructed in a weight value distribution mode, and fraud numbers are identified as follows by combining a black-and-white list mechanism:
respectively constructing normal and fraud, normal and promotion, promotion and fraud two classification models, wherein the single two classification models respectively use an integrated learning algorithm based on boosting and bagging to carry out combined learning, and finally, the combined learning result is output in probability; integrating the advantages of boosting and bagging integrated learning algorithms, and selecting different models for integration;
combining the probabilities, and performing weight matching by a grid search method to construct a three-classification model;
and filtering a white list and matching a black list in the black and white list, and identifying the remaining unidentifiable numbers by adopting a three-classification identification model.
Wherein the black and white list includes: identified and identified fraudulent calls, promotional calls, reliably marked takeaway calls, taxi driver calls, drip driver calls, and registered company calls, with promotional and fraudulent calls being identified as a class.
A fraud telephone recognition system based on multi-source features comprises a basic feature extraction module, a fraud mode structure recognition module, a time sequence feature combination construction module, a feature sample data set balancing module and a fusion classification recognition module;
the basic feature extraction module is used for constructing second-degree call data, position data and internet surfing data of users in three categories including normal numbers, sales promotion numbers and fraud numbers, and extracting basic features of the users based on the second-degree call data, the position data and the internet surfing data;
the fraud mode structure identification module constructs a Struct2Vec graph network model based on graph structure similarity based on second-degree conversation data of a user, extracts the structural characteristics of a second-degree network of the user and identifies a multi-point one-line fraud mode structure;
the time sequence characteristic combination construction module is used for converting first-time call data of a user into the user call time sequence data and constructing a time sequence-based call time sequence characteristic combination according to the call time sequence data;
the characteristic sample data set construction module is used for combining and fusing the basic characteristics of the user, the structural characteristics of the user two-degree network and the call time sequence characteristics, and constructing a characteristic sample data set by adopting a characteristic filtering and dimension reduction mode;
a characteristic sample data set balancing module balances the characteristic sample data set by using an oversampling method Borderline-SMOTE;
the fusion classification identification module is used for constructing a black-and-white list mechanism, constructing a plurality of different integrated learning combinations including boosting and bagging based on the balanced feature sample data set, constructing a normal, fraud and promotion identification fusion classification model based on the integrated learning combinations and by adopting a weight value distribution mode, and identifying fraud calls.
A computer device comprises one or more processors and a memory, wherein the memory is used for storing computer executable programs, the processors read part or all of the computer executable programs from the memory and execute the computer executable programs, and the processors can realize the multi-source-feature-based fraud telephone identification method when executing part or all of the computer executable programs.
A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, can implement the multi-source feature-based fraud phone identification method of the present invention.
Compared with the prior art, the invention has at least the following beneficial effects:
the invention proposes that the sales promotion telephone is independently proposed as a class, a three-classification identification model is constructed, some high-frequency calls with longer life cycle and high frequency are filtered through a black and white list by adding a black and white list filtering mode, and meanwhile, the influence of the sales promotion telephone is removed through the three-classification model, so that the mechanism can be more fit with the actual communication environment, and the fraud telephone identification effect is greatly improved;
the invention provides a method for constructing various time sequence data based on user call data, extracting different call time sequence characteristics of a user through a tsfresh time sequence packet, and analyzing the call data of the user more deeply from the angle of time sequence characteristic mining, so that more implicit characteristics such as call modes are obtained, and good effect is achieved;
the invention provides a structral 2Vec graph network model based on similar graph structures, which extracts the structural characteristics of a user two-degree network, identifies a multipoint front line and some hidden fraud mode structures, and has the mode characteristics of fraud telephones, which are distinguished from promotional telephones, and from the perspective of the mode structures, the structural characteristics of the fraud telephones are mined and are different from other types of telephones, so that the method has an important role in identifying the fraud telephones;
the invention provides a classification model for identifying normal, fraud and promotion based on construction of various different integrated learning algorithm combinations including boosting and bagging and by adopting a weight value distribution mode. Constructing a more efficient classification model by using different ensemble learning combination algorithms;
compared with the prior art, the set of combined algorithm for recognizing the fraud phone based on the multi-source characteristics can distinguish different characteristics of the fraud phone from different angles, and an algorithm model and blacklist and white list mechanism is constructed, so that the application of the model is more suitable for the actual more complex communication environment, and finally, each evaluation index for recognizing the fraud phone is higher and the effect is better.
Drawings
FIG. 1 is a flow chart of a method of identifying fraudulent calls based on multi-source features of the present invention.
Detailed Description
The invention provides a multisource feature-based fraud telephone identification method, which comprises the following steps: the user selects a two-degree call signaling, portrait data, position data and internet surfing data which comprise normal numbers, sales promotion numbers and fraud numbers, a user classification which is more practical is constructed, multi-source characteristic indexes are constructed based on the two-degree call signaling, portrait data, position data and internet surfing data of the selected user for a period of time, the multi-source characteristic indexes comprise basic characteristics of user call data, basic call characteristics of the user, portrait characteristics, position and internet surfing characteristics of the user, structural characteristics of a two-degree network of the user are extracted based on a graph network model with similar graph structures, fraud mode structures such as a multi-point one-line mode and the like are identified, a time sequence-based characteristic combination is extracted based on conversion of the two-degree call data of the user into call time sequence data, a Borderline-SMOTE balance sample data set is constructed based on the construction of the multi-source characteristics, a normal, fraud and sales identification classification model is finally constructed, and the model is trained and predicted by utilizing a plurality of different integrated learning combination modes, and combining a black and white list filtering mechanism to realize accurate and effective identification of fraud calls, referring to fig. 1.
The method specifically comprises the following steps:
step one, sample data of three categories including a normal number, a sales promotion number and a fraud number are constructed, and the method comprises the following steps: the second-degree call data, the position data and the internet surfing data of the user are extracted based on the second-degree call data, the position data and the internet surfing data, and the basic features of the selected user are extracted based on the second-degree call data, the position data and the internet surfing data, and the method comprises the following steps: the basic call characteristics, the portrait characteristics, the location characteristics and the internet surfing characteristics of the user.
And step two, constructing a graph network model-Struct 2Vec based on the similarity of graph structures based on the second-degree call data of the user, extracting the structural characteristics of the second-degree network of the user, and identifying the fraud mode structures such as a multi-point first-line network and the like.
And step three, converting the call data based on the user into call time sequence data, and constructing a call time sequence characteristic combination based on the time sequence.
And step four, balancing the sample data set by using an improved SMOTE oversampling method Borderline-SMOTE.
And step five, constructing a plurality of different integrated learning combinations including boosting and bagging based on the characteristic data, constructing classification models of normal, fraud and promotion identification by adopting a weight value distribution mode, and combining a black-and-white list mechanism to accurately identify fraud numbers.
To make the objects, technical solutions and advantages of the present invention clearer, the present invention is further described in detail below with reference to the accompanying drawings.
Aiming at fraud telephones, sales promotion telephones and common telephone numbers obtained by random sampling, which provide samples for big data in operation, the telephone numbers are subjected to acquisition of call data, image data, position data and internet surfing data; and laying a data foundation for the following steps.
Step one, sample data of three categories including a normal number, a sales promotion number and a fraud number are constructed, wherein the sample data comprises the following data of a user: and extracting the basic characteristics of the selected user based on the second-degree call data, the position data and the internet surfing data of the user, wherein the basic characteristics comprise the basic call characteristics, the portrait information characteristics, the position information characteristics and the internet surfing information characteristics of the user.
Wherein the call behavior characteristics include: the total number of calls and the occupation ratio of the calling and called in different time periods, the call duration and the occupation ratio of the calling and called in different time periods, the average call duration, the variance and the standard deviation of the calling and called, the ratio of the outgoing call to the incoming call in working days in a week, the ratio of the outgoing call to the incoming call in each time period in a day, the occupation ratio of the number of the non-answered calls of the calling and called, the average interval time of the calls, the average interval time of the same number, whether the outgoing number section has a serial number and the occupation ratio, the call duration (hour, day, week and month) in unit time, the average call duration of the continuous calls, the probability of the continuous calls, the occupation ratio of the number of the calling and called of 1, whether the called dials a marked telephone such as 110 or 12321 and the like in a certain time after the called, whether the called is a contact or stranger of the calling, the number of different numbers of the calling and called, the occupation ratio, mean value, variance and standard deviation of ringing time length, and the above-mentioned indexes of opposite terminal user communication.
The location information features include: the calling positions are distributed according to the calling proportion, the distribution of all the positions accounts for the specific entropy, the accounts for different positions of the calling party, the province number, the city number and the entropy value of the daily average call, the account ratio of the called party to the number of the foreign places, the maximum daily distance, the average times and the entropy value of the access base station, the times and the proportion of the call in the workplace and the home, the number of records and the accounts for the positions at the daytime and at the night, and the like.
The portrait information features include: the gender, age, account opening time, accumulated active days, recent telephone charge expense distribution, whether to use the extranet, registration mode, last downtime, times of changing the telephone, times of changing the card, affiliated operators, whether to use the virtual number terminal, and the like.
The internet surfing information characteristics comprise: flow rate correlations, such as: flow total usage, mean, variance, usage trend, and so on, during a certain time period, url access statistics such as: malicious websites, adult websites, gambling websites, etc. visit trends and times, APP usage statistics such as: the application times and proportion of common APP, the application statistics of special APP and the like.
And step two, constructing a graph network model-Struct 2Vec based on the similarity of graph structures based on the second-degree call data of the user, extracting the structural characteristics of the second-degree network of the user, and identifying the fraud mode structures such as a multi-point first-line network and the like. Comprises the following steps:
step 1: and constructing a graph by using a two-degree call network, and acquiring the vertex pair distance of each layer based on each vertex, wherein the layer takes the vertex as an origin, the first-degree network is the first layer, the second-degree network is the second layer, and the like.
Vertex-to-distance formula:
fk(u,v)=fk-1(u,v)+g(s(Rk(u)),s(Rk(v))),k≥0 and|Rk(u)|,|Rk(v)|>0
wherein R isk(u) denotes the set of vertices with a distance k to vertex u, Rk(v) Denotes the set of vertices, S (R), with a distance k to the vertex vk(u)) represents a vertex set Rk(u) the ordered sequence of degree, i.e. the vertices of the set are all vertices with a distance k to vertex u, and are arranged in order according to the degree of the vertex.
g (D1, D2) ≧ 0 is a function which measures the distance of the order sequence D1, D2, i.e., the distance of the two order sequences. Due to, s (R)k(u)) and s (R)k(v) May be different in length and may contain repeated elements, where Dynamic Time Warping (DTW) is used to measure two order sequences, i.e. DTW may be used to measure the distance between two sequences of different length and containing repeated elements, and based on DTW, the distance function between elements is defined as:
Figure BDA0002927916690000091
the distance function thus defined penalizes the fact that the degrees of both vertices are comparedThe difference between the two is small;
fk(u, v) represents the structural distance on the loop between vertices u and v at a distance k, where k is actually the set of nodes at distances less than or equal to k, since each time f is addedk-1(u, v), add iteratively, which is a function of vertex to distance.
Step 2: construction of weighted hierarchical graph according to vertex pair distance
For each k, a distance between two vertices can be calculated, and this step is mainly used to construct a hierarchical weighted graph through the ordered degree sequence distances between the vertices obtained above, for subsequent random walks.
Defining the edge weights of two vertices in a certain layer k as
Figure BDA0002927916690000092
The edge weights thus defined are all less than 1, and an edge weight of 1 if and only if the distance is 0.
The same vertexes belonging to different levels are connected through directed edges, namely, each vertex is connected with the corresponding same upper-layer vertex and lower-layer vertex.
And step 3: randomly walking sample vertex sequences in weighted hierarchy chart
And sampling vertex sequences in the weighted hierarchical graph based on a random walk mode. Firstly, selecting any point as a starting point, carrying out random walk to obtain a point sequence, then regarding the obtained sequence as a sentence, and learning by using word2vec to obtain a vertex expression embedded feature vector; and traversing the vertexes in the weighted hierarchical graph, and randomly walking to obtain the embedded feature vectors of all the vertexes.
And step three, converting the call data based on the user into call time sequence data, and constructing a call time sequence characteristic combination based on the time sequence. The method comprises the following steps:
step 1: establishing conversation time sequence data of the user based on the conversation and the time length of the conversation which occurs at each moment of the user, the conversation times of each day in a set time period and the conversation interval which occurs at each moment of the user;
step 2: using the open source package tsfresh in python, the constructed timing data is input, and the set timing characteristics are output, including but not limited to: the speech time sequence characteristic value input by the model is 64 time sequence change-based characteristics such as the square sum of a time sequence, the sum of absolute values of continuous changes of the sequence, sequence approximate entropy (used for measuring periodicity, unpredictability and fluctuation of a time sequence), autoregressive model coefficients, the number of numbers which are larger (smaller) than the average value, the maximum value, the minimum value, the repeated value, the length of the longest continuous subsequence which is larger (smaller) than the average value, the average value of the absolute values of the continuous changes and the like.
And step four, balancing the sample data set by using an improved SMOTE oversampling method Borderline-SMOTE. The method comprises the following steps:
step 1: dividing a pre-training data feature set into a training set and a testing set according to a preset proportion, fixing random seeds, keeping the testing set unchanged, and performing effect comparison;
step 2: based on Borderline SMOTE oversampling technology, a training set is operated, a few fraud samples in the training set are classified into 3 classes, namely Safe, Danger and Noise, wherein the Safe class is that more than half of the samples are all few samples around the samples, the Danger class is that more than half of the samples are all majority samples around the samples, the samples are regarded as the samples on the boundary, the Noise class is that the samples are all majority samples around the samples, the Noise class is regarded as Noise, and only the minority samples of the Danger class are oversampled.
And step 3: and (4) oversampling the minority class of the Danger class, randomly selecting the minority class sample by adopting a K neighbor method, and oversampling the minority sample.
And 4, step 4: and randomly changing the fixed test set, comparing the effects, and selecting a proper extensible proportion and parameters.
Step five, constructing a plurality of different integrated learning combinations including boosting and bagging based on the characteristic data, constructing classification models of normal, fraud and promotion identification by adopting a weight value distribution mode, and accurately identifying fraud numbers by combining a black-and-white list mechanism, wherein the specific steps are as follows:
step 1: respectively constructing normal and fraud, normal and promotion, promotion and fraud classification models, wherein the single two classification models respectively use boosting and bagging-based ensemble learning algorithms for combined learning, such as: random forests in bagging, XGBOST, LightGBM, GBDT and AdaBoost in boosting, and two integrated learning modes in comprehensive integrated learning are integrated again. In the training process, the final model training result is output in probability.
Step 2: and combining the probabilities, selecting proper weights by a grid search method for distribution, and constructing a three-classification model. Such as: x, y and z represent selected model output, a, b and c respectively represent weight distribution, so that a + b + c is 1, the optimal algorithm weight distribution is selected by adopting a grid search method, and each classifier is combined and output to construct a three-classification recognition model; w ═ ax + by + cz, a + b + c ═ 1
And step 3: white list filtering and black list matching are carried out in a black and white list, and the residual unidentifiable number is identified by adopting a three-classification identification model;
wherein the black and white list includes: the identified and determined fraud phone, the sales promotion phone, some reliably marked takeaway personnel, taxies, dribbles, registered company phones and the like have the characteristics of high calling frequency and high frequency, but the life cycle of the numbers is long, once the numbers are identified, the numbers can be effectively filtered through a black and white list, and the sales promotion phone is similar to the fraud phone and has the characteristics of high calling frequency and short life cycle, so that the fraud phone is identified in a single type.
Optionally, the present invention further provides a computer device, including but not limited to one or more processors and a memory, where the memory is used for storing computer executable programs, the processor reads part or all of the computer executable programs from the memory and executes the computer executable programs, and the processor can implement part or all of the steps of the multi-source feature-based fraud telephone recognition method of the present invention when executing part or all of the computer executable programs.
The device for identifying the pompe frauds in the etherhouse may be a laptop, a tablet computer, a desktop computer or a workstation.
The processor may be a Central Processing Unit (CPU), a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), or an off-the-shelf programmable gate array (FPGA).
The memory of the invention can be an internal storage unit of a notebook computer, a tablet computer, a desktop computer, a mobile phone or a workstation, such as a memory and a hard disk; external memory units such as removable hard disks, flash memory cards may also be used.
Computer-readable storage media may include computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. The computer-readable storage medium may include: a Read Only Memory (ROM), a Random Access Memory (RAM), a Solid State Drive (SSD), or an optical disc. The random access memory may include a resistive random access memory (ReRAM).

Claims (10)

1. A fraud telephone identification method based on multi-source characteristics is characterized by comprising the following steps:
establishing second-degree call data, position data and internet surfing data of users of three categories including normal numbers, promotion numbers and fraud numbers, and extracting basic characteristics of the users based on the second-degree call data, the position data and the internet surfing data;
constructing a Struct2Vec graph network model based on graph structure similarity based on second-degree call data of a user, extracting structural features of a second-degree network of the user, and identifying a multi-point one-line fraud mode structure;
converting first-time call data of a user into the call time sequence data of the user, and constructing a call time sequence characteristic combination based on time sequence according to the call time sequence data;
combining the basic characteristics of the user, the structural characteristics of a user two-degree network and the conversation time sequence characteristics, and constructing a characteristic sample data set by adopting a characteristic filtering and dimension reduction mode;
balancing the feature sample data set by using an oversampling method Borderline-SMOTE;
and constructing a black and white list mechanism, constructing a plurality of different ensemble learning combinations including boosting and bagging based on the balanced feature sample data set, constructing a normal, fraud and promotion identification fusion classification model by adopting a weight value distribution mode based on the ensemble learning combinations, and identifying fraud calls.
2. The multi-source feature-based fraud phone identification method of claim 1, wherein said user's basic features comprise a user's basic call feature, portrait information feature, location information feature and internet information feature.
3. The multi-source-feature-based fraud phone identification method of claim 1, wherein structural features of user second-degree network are extracted based on Struct2Vec graph network model with similar graph structure, identifying multi-point first-line fraud mode structure as follows:
constructing a graph by using a two-degree call network, and acquiring a vertex pair distance of each layer network based on each vertex of the network, wherein the layer takes the vertex as an origin, the first-degree network is a first layer, the second-degree network is a second layer, and the like;
vertex pair distance fk(u, c) are:
fk(u,v)=fk-1(u,v)+g(s(Rk(u)),s(Rk(v))),k≥0 and|Rk(u)|,|Rk(v)|>0
wherein R isk(u) denotes the set of vertices with a distance k to vertex u, Rk(v) Denotes the set of vertices, S (R), with a distance k to the vertex vk(u)) represents a set of vertices Pk(u) the order degree sequence, that is, the set vertices are all vertices with a distance k to the vertex u, and are arranged in order according to the degrees of the vertices;
g (D1, D2) ≧ 0 is a function which measures the distance between the order sequences D1 and D2, i.e., the distance between two order sequencesBased on Dynamic Time Warping, the distance function between elements is defined as:
Figure FDA0002927916680000021
fk(u, v) represents the structural distance on the loop between vertices u and v at a distance k, where k is actually a set of nodes at distances equal to or less than k, and f is added each timek-1(u, v), iteratively adding, which is a function of vertex to distance;
calculating a distance between two vertexes for each k, and constructing a weighted hierarchical graph through the ordered degree sequence distance between the vertexes for subsequent random walk;
defining the edge weights of two vertices in a certain layer k as
Figure FDA0002927916680000022
The edge weights are all less than 1, and the edge weight is 1 if and only if the distance is 0;
connecting the same vertexes belonging to different levels through directed edges, namely connecting each vertex with the corresponding same upper-layer vertex and lower-layer vertex to obtain a weighted level graph;
sampling vertex sequences in the weighted hierarchical graph based on a random walk mode, selecting each vertex as a starting point, performing random walk to obtain a sequence of points, then regarding the sequence as a sentence, learning by using word2vec to obtain a representation embedded feature vector of each vertex, excavating a structure of each vertex in a two-degree network of the vertex, and generating the embedded feature vector to obtain the structure for identifying the multi-point one-line fraud mode.
4. The method for recognizing fraud calls based on multi-source features of claim 1, wherein based on the call data of the user, the call timing data is converted to construct a time-sequence-based call timing feature combination as follows:
establishing conversation time sequence data of a user based on conversation and time length of the conversation which occurs at each moment of the user, conversation times of each day in a set time period and conversation intervals which occur at each moment of the user, and extracting time sequence characteristics according to the time sequence data; and inputting the constructed time sequence data by using the open source packet tsfresh in python, and outputting the set time sequence characteristics.
5. The multi-source feature-based fraud phone identification method of claim 1, wherein balancing said feature sample dataset with a modified SMOTE oversampling method Borderline-SMOTE is specifically as follows:
dividing the characteristic sample data set into a training set and a test set according to a preset proportion, wherein the test set is unchanged;
based on Borderline-SMOTE oversampling technology, operating a training set, and classifying a few fraud samples in the training set into 3 classes, namely Safe, Danger and Noise, wherein the Safe class is that more than one half of the samples are all few samples, the Danger class is that more than one half of the samples are all majority samples, the samples are regarded as the samples on the boundary, and the Noise class is that the samples are all majority samples, and the Noise class is that the samples are regarded as the Noise;
and (4) oversampling the minority class of the Danger class, randomly selecting the minority class sample by adopting a K neighbor method, and oversampling the minority class sample.
6. The multi-source feature-based fraud phone identification method of claim 1, wherein a plurality of different ensemble learning combinations including boosting and bagging are constructed based on multi-source feature data, and a classification model of normal, fraud and promotion identification is constructed by means of weight assignment, and in combination with a black-and-white list mechanism, fraud numbers are identified as follows:
respectively constructing normal and fraud, normal and promotion, promotion and fraud two classification models, wherein the single two classification models respectively use an integrated learning algorithm based on boosting and bagging to carry out combined learning, and finally, the combined learning result is output in probability; integrating the advantages of boosting and bagging integrated learning algorithms, and selecting different models for integration;
combining the probabilities, and performing weight matching by a grid search method to construct a three-classification model;
and filtering a white list and matching a black list in the black and white list, and identifying the remaining unidentifiable numbers by adopting a three-classification identification model.
7. The multi-source feature-based fraud telephone identification method of claim 1, wherein the black-and-white list comprises: identified and identified fraudulent calls, promotional calls, reliably marked takeaway calls, taxi driver calls, drip driver calls, and registered company calls, with promotional and fraudulent calls being identified as a class.
8. A fraud telephone recognition system based on multi-source features is characterized by comprising a basic feature extraction module, a fraud mode structure recognition module, a time sequence feature combination construction module, a feature sample data set balance module and a fusion classification recognition module;
the basic feature extraction module is used for constructing second-degree call data, position data and internet surfing data of users in three categories including normal numbers, sales promotion numbers and fraud numbers, and extracting basic features of the users based on the second-degree call data, the position data and the internet surfing data;
the fraud mode structure identification module constructs a Struct2Vec graph network model based on graph structure similarity based on second-degree conversation data of a user, extracts the structural characteristics of a second-degree network of the user and identifies a multi-point one-line fraud mode structure;
the time sequence characteristic combination construction module is used for converting first-time call data of a user into the user call time sequence data and constructing a time sequence-based call time sequence characteristic combination according to the call time sequence data;
the characteristic sample data set construction module is used for combining and fusing the basic characteristics of the user, the structural characteristics of the user two-degree network and the call time sequence characteristics, and constructing a characteristic sample data set by adopting a characteristic filtering and dimension reduction mode;
a characteristic sample data set balancing module balances the characteristic sample data set by using an oversampling method Borderline-SMOTE;
the fusion classification identification module is used for constructing a black-and-white list mechanism, constructing a plurality of different integrated learning combinations including boosting and bagging based on the balanced feature sample data set, constructing a normal, fraud and promotion identification fusion classification model based on the integrated learning combinations and by adopting a weight value distribution mode, and identifying fraud calls.
9. A computer device, comprising one or more processors and a memory, wherein the memory is used for storing computer executable programs, the processors read part or all of the computer executable programs from the memory and execute the computer executable programs, and the processors can realize the multi-source feature-based fraud telephone recognition method according to any one of claims 1-7 when executing part or all of the computer executable programs.
10. A computer-readable storage medium, wherein a computer program is stored in the computer-readable storage medium, and when being executed by a processor, the computer program can implement the multi-source feature-based fraud telephone identification method of any one of claims 1-7.
CN202110138462.XA 2021-02-01 2021-02-01 Multi-source feature-based fraud telephone identification method, system and equipment Active CN112866486B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110138462.XA CN112866486B (en) 2021-02-01 2021-02-01 Multi-source feature-based fraud telephone identification method, system and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110138462.XA CN112866486B (en) 2021-02-01 2021-02-01 Multi-source feature-based fraud telephone identification method, system and equipment

Publications (2)

Publication Number Publication Date
CN112866486A true CN112866486A (en) 2021-05-28
CN112866486B CN112866486B (en) 2022-06-07

Family

ID=75987559

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110138462.XA Active CN112866486B (en) 2021-02-01 2021-02-01 Multi-source feature-based fraud telephone identification method, system and equipment

Country Status (1)

Country Link
CN (1) CN112866486B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114037460A (en) * 2021-11-25 2022-02-11 深圳安巽科技有限公司 Comprehensive anti-fraud platform, method and storage medium
CN114828013A (en) * 2022-06-27 2022-07-29 北京芯盾时代科技有限公司 Fraud number recognition and model training method thereof, related equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060149674A1 (en) * 2004-12-30 2006-07-06 Mike Cook System and method for identity-based fraud detection for transactions using a plurality of historical identity records
CN110147430A (en) * 2019-04-25 2019-08-20 上海欣方智能系统有限公司 Harassing call recognition methods and system based on random forests algorithm
CN111222025A (en) * 2019-12-27 2020-06-02 南京中新赛克科技有限责任公司 Fraud number identification method and system based on convolutional neural network
CN111726460A (en) * 2020-06-15 2020-09-29 国家计算机网络与信息安全管理中心 Fraud number identification method based on space-time diagram
CN112199388A (en) * 2020-09-02 2021-01-08 卓望数码技术(深圳)有限公司 Strange call identification method and device, electronic equipment and storage medium
CN112291424A (en) * 2020-10-29 2021-01-29 上海观安信息技术股份有限公司 Fraud number identification method and device, computer equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060149674A1 (en) * 2004-12-30 2006-07-06 Mike Cook System and method for identity-based fraud detection for transactions using a plurality of historical identity records
CN110147430A (en) * 2019-04-25 2019-08-20 上海欣方智能系统有限公司 Harassing call recognition methods and system based on random forests algorithm
CN111222025A (en) * 2019-12-27 2020-06-02 南京中新赛克科技有限责任公司 Fraud number identification method and system based on convolutional neural network
CN111726460A (en) * 2020-06-15 2020-09-29 国家计算机网络与信息安全管理中心 Fraud number identification method based on space-time diagram
CN112199388A (en) * 2020-09-02 2021-01-08 卓望数码技术(深圳)有限公司 Strange call identification method and device, electronic equipment and storage medium
CN112291424A (en) * 2020-10-29 2021-01-29 上海观安信息技术股份有限公司 Fraud number identification method and device, computer equipment and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114037460A (en) * 2021-11-25 2022-02-11 深圳安巽科技有限公司 Comprehensive anti-fraud platform, method and storage medium
CN114828013A (en) * 2022-06-27 2022-07-29 北京芯盾时代科技有限公司 Fraud number recognition and model training method thereof, related equipment and storage medium
CN114828013B (en) * 2022-06-27 2022-10-28 北京芯盾时代科技有限公司 Fraud number recognition and model training method thereof, related equipment and storage medium

Also Published As

Publication number Publication date
CN112866486B (en) 2022-06-07

Similar Documents

Publication Publication Date Title
US11087180B2 (en) Risky transaction identification method and apparatus
CN110956547B (en) Method and system for identifying fraudulent party in real time based on search engine
CN110198310A (en) A kind of anti-cheat method of network behavior, device and storage medium
CN110781308B (en) Anti-fraud system for constructing knowledge graph based on big data
CN112866486B (en) Multi-source feature-based fraud telephone identification method, system and equipment
Sundsøy et al. Deep learning applied to mobile phone data for individual income classification
US10692089B2 (en) User classification using a deep forest network
CN111475613A (en) Case classification method and device, computer equipment and storage medium
CN112508094A (en) Junk picture identification method, device and equipment
CN115713715B (en) Human behavior recognition method and recognition system based on deep learning
CN113709125A (en) Method and device for determining abnormal flow, storage medium and electronic equipment
CA3204311A1 (en) Method and system for securely deploying an artificial intelligence model
WO2023029397A1 (en) Training data acquisition method, abnormal behavior recognition network training method and apparatus, computer device, storage medium, computer program and computer program product
CN110348516A (en) Data processing method, device, storage medium and electronic equipment
CN113191787A (en) Telecommunication data processing method, device electronic equipment and storage medium
CN114511022B (en) Feature screening, behavior recognition model training and abnormal behavior recognition method and device
CN110855474A (en) Network feature extraction method, device, equipment and storage medium of KQI data
WO2024001102A1 (en) Method and apparatus for intelligently identifying family circle in communication industry, and device
CN113011503B (en) Data evidence obtaining method of electronic equipment, storage medium and terminal
CN112651333B (en) Silence living body detection method, silence living body detection device, terminal equipment and storage medium
CN111465021B (en) Graph-based crank call identification model construction method
CN114841705A (en) Anti-fraud monitoring method based on scene recognition
CN108564380B (en) Telecommunication user classification method based on iterative decision tree
CN108229518B (en) Statement-based image detection method, device and system
Chouiten et al. Vision based mobile gas-meter reading

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant