CN109451182B - Detection method and device for fraud telephone - Google Patents

Detection method and device for fraud telephone Download PDF

Info

Publication number
CN109451182B
CN109451182B CN201811219800.7A CN201811219800A CN109451182B CN 109451182 B CN109451182 B CN 109451182B CN 201811219800 A CN201811219800 A CN 201811219800A CN 109451182 B CN109451182 B CN 109451182B
Authority
CN
China
Prior art keywords
fraud
text
calls
call
keyword
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811219800.7A
Other languages
Chinese (zh)
Other versions
CN109451182A (en
Inventor
林荣恒
张震
彭潞
闵星
吴步丹
邹华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
National Computer Network and Information Security Management Center
Original Assignee
Beijing University of Posts and Telecommunications
National Computer Network and Information Security Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications, National Computer Network and Information Security Management Center filed Critical Beijing University of Posts and Telecommunications
Priority to CN201811219800.7A priority Critical patent/CN109451182B/en
Publication of CN109451182A publication Critical patent/CN109451182A/en
Application granted granted Critical
Publication of CN109451182B publication Critical patent/CN109451182B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/22Arrangements for supervision, monitoring or testing
    • H04M3/2281Call monitoring, e.g. for law enforcement purposes; Call tracing; Detection or prevention of malicious calls
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W12/00Security arrangements; Authentication; Protecting privacy or anonymity
    • H04W12/12Detection or prevention of fraud
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M2203/00Aspects of automatic or semi-automatic exchanges
    • H04M2203/60Aspects of automatic or semi-automatic exchanges related to security aspects in telephonic communication systems
    • H04M2203/6027Fraud preventions

Abstract

The application discloses a method for detecting fraud calls, which comprises the following steps: converting all call voices into texts to form a text set; converting each text in the text set into a keyword weight vector; forming a plurality of clusters for all the keyword weight vectors through text clustering, and determining whether each cluster is a fraud cluster according to the fraud keyword set; determining the calls corresponding to all the keyword weight vectors in the fraud cluster as fraud calls; constructing a text social network by using all calls and the keywords, marking nodes corresponding to the fraud calls as fraud calls in the text social network, and determining other nodes marked as fraud calls through label propagation; all calls corresponding to nodes marked as fraudulent calls are determined to be fraudulent calls. By applying the method and the device, the device and the system can be suitable for various fraud types, meanwhile, the sensitive data of the user does not need to be acquired, and the operability is stronger.

Description

Detection method and device for fraud telephone
Technical Field
The present application relates to the field of complex networks and mobile communication technologies, and in particular, to a method and an apparatus for detecting fraudulent calls.
Background
With the continuous development of the communication industry, the communication industry brings more convenience, and meanwhile, the telecom phishing activity is rampant, more and more telephone phishing means are provided, and people can be prevented from being defeat.
Currently, the fraud call detection methods mainly include call site source detection, blacklist interception and the like, and generally have the characteristics of poor real-time performance and poor flexibility. The updating of fraud techniques and methods can easily lead to the failure of the original interception means. At present, the high-speed fraud calls are mainly concentrated on mobile phone users, most calling numbers come from abroad, and fraud calls are realized by fraudsters bypassing the existing intercepting means through number-changing software or VOIP technology.
Research shows that the fraud means of fraudulent calling are often similar and can be divided into several large categories. Meanwhile, a certain organizational structure exists inside a fraud group, a stage-by-stage characteristic also exists in fraud behaviors, and the fraud is dialed in a 'wide-area network' manner by a fraud member in the first stage, so that the fraud often has obvious behavioral characteristics, such as high calling frequency, high called dispersion, long average calling time, low call completing rate and the like, which is mainly because criminals use a calling platform to perform group call scanning in the stage to search potential victims. After a potential victim is found, after the next stage, the behavior characteristics of the fraud phone are closer to the characteristics of normal conversation, and mining through calling characteristics is difficult, but the conversation contents of the fraud phone at the stage have certain similarity and often relate to sensitive keywords such as account transfer and the like.
There are also some methods of identifying fraudulent calls, but there are various problems, such as:
the method comprises the steps of collecting call ticket data, analyzing a blacklist, carrying out one-way recording on a blacklist call, and comparing a recording file with a fraud voice sample library so as to determine whether the call is fraud or not. The method has the main defects that the fraud means is constantly changed, the difficulty in constructing a voice library covering the whole network is high, and resources are extremely consumed;
the second method is that the number characteristics and/or the behavior characteristics of the real-time call ticket are extracted; and analyzing the number characteristics and/or the behavior characteristics of the real-time call bill according to a preset fraud call identification model so as to determine whether the conversation behavior corresponding to the real-time call bill is a fraud call. The method has the main disadvantages that only by judging whether the extracted behavior characteristics such as calling frequency and called dispersion conform to the fraud phone recognition model or not, only the number with larger behavior characteristics and normal calling difference can be found out, the false interception rate of fraud calls with behavior modes close to normal calling is higher, and the evolution of the fraud recognition model is more difficult along with the continuous updating of fraud means;
acquiring abnormal behavior data and characteristic data of the telephone number in the original call ticket; the abnormal behavior data comprises one or more of abnormal number calling times, vacant number calling times and strange number calling times; the characteristic data comprises activity degree and call data, the two data are input into a trained fraud telephone number analysis model, and a fraud telephone number analysis result is obtained through a weighted naive Bayes classification algorithm.
The complex network is an abstraction of the complex system, and in reality, many complex systems can be described and analyzed by using the relevant characteristics of the complex network. The research on the complex network is always a research hotspot in many fields, and schemes are proposed to utilize the complex network for fraud-identifying calls, such as the following method four and method five.
Abstracting the individuals in the complex social network into vertexes, abstracting each relation among the individuals in the social network into edges, endowing each edge with a weight according to the strength of the relation, establishing an adjacency matrix, and defining a cheating group by gathering the relations of the vertexes corresponding to the users. After the fraudsters or defaults in the fraud group are identified, the fraud risk or credit risk of other users in the social network is recalculated. However, this fraud call identification method needs to acquire personal information and social information of the user, which are often sensitive and difficult to acquire, and sometimes, only by means of existing data, it may cause a large error.
And fifthly, obtaining test source data through the social graph, testing the tested system by the test source data to generate a prediction model, and executing operation through a fraud group detection technology based on the social network, wherein the test source data comprises information such as a user authorization address book, a call record, a short message record and an emergency contact. The patent has the disadvantages that sensitive information such as an address book needs to be collected, and the patent is mainly applied to credit fraud and has weak practicability for fraud calls.
As can be seen from the above, many existing fraud phone detection methods have the problem that they cannot adapt to different fraud types and means, and in the method of identifying fraud phones by using a complex network, it is often necessary to acquire sensitive information of users, and the operability is not strong.
Disclosure of Invention
The application provides a fraud phone detection method and device, which can be suitable for various fraud types, do not need to acquire user sensitive data, and have stronger operability.
In order to achieve the purpose, the following technical scheme is adopted in the application:
a method of detecting fraudulent calls, comprising:
converting all call voices into texts to form a text set; converting each text in the text set into a keyword weight vector;
forming a plurality of clusters for all the keyword weight vectors through text clustering, and determining whether each cluster is a fraud cluster according to the fraud keyword set; determining the calls corresponding to all the keyword weight vectors in the fraud cluster as fraud calls;
constructing a text social network by using all calls and the keywords, marking nodes corresponding to the fraud calls as fraud calls in the text social network, and determining other nodes marked as fraud calls through label propagation; all calls corresponding to nodes marked as fraudulent calls are determined to be fraudulent calls.
Preferably, the method further comprises:
constructing a call ticket social network by using all calling numbers of all calls in the fraud cluster and all numbers which have call relations with the calling numbers, and performing community discovery; determining that the corresponding community is a fraud community or a non-fraud community according to the number of fraud numbers included in each community in the ticket social network; converting all call voices in the fraud community into texts, then performing text clustering, extracting new keywords, adding the keywords into the fraud keyword set, and using the keywords in the process of converting all call voice texts into keyword weight vectors next time;
wherein the fraud number is a calling number in the fraud call.
Preferably, said determining whether each cluster is a fraudulent cluster comprises: for each cluster, comparing the characteristics of the cluster with the characteristics of preset fraud clusters according to the fraud keyword set, and determining whether the corresponding cluster is a fraud cluster.
Preferably, said comparing the characteristics of the cluster with the characteristics of the preset fraudulent clusters to determine whether the respective cluster is a fraudulent cluster comprises:
selecting words included in the fraud keyword set from all keyword vectors as fraud keywords;
calculating the sum x of the weights of all fraud keywords in the cluster, calculating the ratio of x to the sum of the weights of all the fraud keywords in the cluster, and if the ratio is greater than a preset threshold, determining that the cluster is a fraud cluster.
Preferably, the conversion of each text in the text set into a keyword weight vector is performed in a TF-IDF manner;
the constructing of the text social network by using all calls and the keywords comprises: and taking the text converted from all call voices and the keywords in all the keyword weight vectors as nodes of the text social network, if the text comprises a keyword, adding an edge between the corresponding text node and the keyword node, wherein the weight of the edge is the TF-IDF value of the corresponding keyword in the corresponding text.
Preferably, the constructing of the call ticket social network includes:
and taking the calling numbers of all calls in the fraud cluster and all numbers which have a call relation with the calling numbers as nodes of the call bill social network, if one call exists between any two nodes, adding an edge between the corresponding nodes, and setting the weight of the corresponding edge according to the characteristics of the call corresponding to each edge.
Preferably, the setting the weight of the corresponding edge according to the feature of the call corresponding to each edge includes: determining the weight of a corresponding edge according to the comprehensive call duration of the call and the attribution of the calling number and the called number; the longer the comprehensive call duration is, the greater the weight of the side is, the more similar the attribution of the calling and called numbers are, and the greater the weight of the side is.
A device for detecting fraudulent calls, comprising: the system comprises a ticket preprocessing unit, a voice recognition unit, a text clustering unit and a text community finding unit;
the call ticket preprocessing unit is used for collecting all call voices and performing data preprocessing operation;
the voice recognition unit is used for converting the conversation voice preprocessed by the ticket preprocessing unit into a text to form a text set;
the text clustering unit is used for converting each text in the text set into a keyword weight vector; forming a plurality of clusters for all the keyword weight vectors through text clustering, and determining whether each cluster is a fraud cluster according to the fraud keyword set; determining the calls corresponding to all the keyword weight vectors in the fraud cluster as fraud calls;
the text community discovery unit is used for constructing a text social network by utilizing all calls and the keywords, marking nodes corresponding to the fraud calls as fraud calls in the text social network, and determining other nodes marked as fraud calls through label propagation; all calls corresponding to nodes marked as fraudulent calls are determined to be fraudulent calls.
Preferably, the device further comprises a ticket community discovery unit, configured to construct a ticket social network by using all calling numbers of all calls in the fraud cluster and all numbers having a call relationship with the calling numbers, and perform community discovery; determining that the corresponding community is a fraud community or a non-fraud community according to the number of fraud numbers included in each community in the ticket social network; and converting all call voices in the fraud community into texts, then performing text clustering, extracting new keywords, adding the keywords into the fraud keyword set, and using the keywords in the process of converting all call voice texts into keyword weight vectors next time.
According to the technical scheme, all call voices are converted into texts to form a text set; converting each text in the text set into a keyword weight vector according to the fraud keyword set; forming a plurality of clusters for all the keyword weight vectors through text clustering, and determining whether each cluster is a fraud cluster; determining the calls corresponding to all the keyword weight vectors in the fraud cluster as fraud calls; constructing a text social network by using all calls in the fraud cluster and the keywords, marking nodes corresponding to the fraud calls as fraud calls in the text social network, and determining other nodes marked as fraud calls through label propagation; all calls corresponding to nodes marked as fraudulent calls are determined to be fraudulent calls. By the method, the unsupervised algorithm of clustering and complex network analysis is adopted, the method is suitable for various fraud types, and meanwhile, the user sensitive data does not need to be acquired, so that the operability is stronger.
Drawings
FIG. 1 is a flow chart illustrating a fraud call detection method according to the present application;
fig. 2 is a schematic structural diagram of a fraud call detection apparatus according to the present application.
Detailed Description
For the purpose of making the objects, technical means and advantages of the present application more apparent, the present application will be described in further detail with reference to the accompanying drawings.
Nodes in a complex network identify individuals in the system and edges identify relationships between individuals, such as social relationship networks, food chains, the world wide web, urban transportation networks, and power grids. The community structure is a common feature in a complex network, communities reflect local features of individual behaviors in the network and the association relationship among the individual behaviors, and the whole complex network is composed of a plurality of communities. Community discovery is a complex and meaningful process that has an important role in studying the characteristics of complex networks. In recent years, many scholars have attracted attention to discovering and analyzing community structures in complex networks, and many community discovery algorithms have appeared. The community in the traditional sense means that a group of nodes in a network have larger similarity, so that a group structure with compact internal connection and sparse external connection is formed, the connection between the nodes in the same community is very compact, and the connection between the communities is sparse.
When judging whether the one-way call is a fraud call, if only the call bill and the content of the one-way call are used, the method is very limited, and the one-way call needs to be placed in all the calls in the same day for comprehensive analysis. That is, a fraud caller may not be able to determine whether a fraud is caused during a certain call due to sampling; but other calls of the calling party are detected to be fraud, the calling party is judged to be fraud calls, and early warning interception is carried out.
Therefore, the unsupervised learning method based on the complex network is provided, and at most, clustering is carried out based on content (voice) data, the complex network is established, community discovery is carried out on the established network, and fraud call recognition is carried out. Furthermore, a complex network can be established based on the clustering result of the content data and the call behavior (call bill), and the community finds out, searches for fraud communities and non-fraud communities on the established network and intercepts fraud calls.
FIG. 1 is a flow chart of a fraud call detection method in the present application. As shown in fig. 1, the method includes:
step 101, converting all call voices into texts to form a text set.
Step 102, each text in the text set is converted into a keyword weight vector.
For any voice text, stop words are filtered, and the voice text is modeled into a word weight vector through a TF-IDF standardized text processing mode. For each word of the word weight vector, there is a corresponding weight. The method of modeling the speech text as a word weight vector may be performed in the existing manner, for example, by TF-IDF conversion, where the weight of the keyword is the TF-IDF value of the word. After the conversion is completed, one call speech text corresponds to one word weight vector. Wherein the stop word is from a stop word list. There are some common stop word lists, such as those of Chinese academy, and it is preferable to add some new stop words based on the existing stop word list and the fraud practical background.
103, clustering all the keyword weight vectors through texts to form a plurality of clusters, and determining whether each cluster is a fraud cluster according to a fraud keyword set; and determining the calls corresponding to all the keyword weight vectors in the fraud cluster as fraud calls.
For the keyword weight vectors, individual clusters are formed by a text clustering method, and any existing method can be adopted for the specific text clustering method, which is not limited in the application. Each cluster is made up of one or more keyword weight vectors, each corresponding to one call voice text as previously described, and thus each cluster may correspond to one or more call voice texts.
And when determining whether a certain cluster is a fraud cluster, performing the fraud according to the fraud keyword set. The fraud keyword set here may be a pre-established set (e.g., a word set formed from prior knowledge and review of relevant documents, etc.), or may also be a new fraud keyword set formed after adding new keywords through step 107 on the basis of the original fraud keyword set. Preferably, the characteristics of the cluster can be compared with the preset characteristics of the fraud cluster according to the fraud keyword set to determine whether the corresponding cluster is a fraud cluster. Specifically, if the ratio of the sum of the weights (e.g., TF-IDF values) of all fraud keywords in the cluster to the sum of the weights (e.g., TF-IDF values) of all keywords exceeds a preset threshold, the cluster is determined as a fraud cluster.
After the fraud clusters are distinguished, all call voice texts corresponding to the fraud clusters are considered as fraud calls. At this point, we can first confirm a portion of the fraudulent conversation.
Step 104, constructing a text social network by using all calls and all keywords, marking nodes corresponding to the fraud calls as fraud calls in the text social network, and determining other nodes marked as fraud calls through label propagation; all calls corresponding to nodes marked as fraudulent calls are determined to be fraudulent calls.
If fraud detection is performed by only relying on text clustering, it is very susceptible to the set of initial fraud keywords. Therefore, there is also a need for establishing a social network for fraud identification. The specific method comprises the following steps: modeling each keyword in all the keyword weight vectors in the step 103 as a node in the network, modeling the call text corresponding to each keyword weight vector in the step 103 as a node in the network, if the text contains a certain keyword, adding an edge between the corresponding text node and the keyword node, and setting the weight of the edge (when the keyword weight vector conversion is performed through TF-IDF, the weight of the edge can be the TF-IDF value of the keyword in the call voice text), thus completing the establishment of the social network. In the established network, the nodes of the fraud calls are marked as fraud calls according to the text clustering result, label propagation is carried out, and finally the nodes marked as fraud are judged as fraud calls. The label propagation may adopt various existing label propagation algorithms, which is not limited in this application.
So far, the most basic fraud call detection method in the present application ends. And selecting more fraud calls through the text social network, and improving the recall rate of fraud compared with the fraud calls identified by text clustering. Some fraudulent calls are often missed because text fraud detection alone does not contain any call behavior information. In order to further identify fraud communities and groups, preferably, the following steps can be further executed, a ticket community network is constructed through the text book record, community discovery is carried out, and further recall fraud is carried out on the basis of the initial fraud sample.
And 105, constructing a call bill social network by using the calling numbers of all calls in the fraud cluster and all numbers having call relations with the calling numbers.
The call ticket social network establishing method based on the call ticket comprises the following steps: determining calling numbers of all calls in the fraud cluster and all numbers which have call relations with the corresponding calling numbers; modeling all the determined numbers as nodes in the network, if one-way calling exists between two number nodes, adding one edge between the corresponding nodes, if multi-way calling has multiple edges, and setting the weight of the corresponding edge according to the characteristics of the corresponding calling of each edge. This completes the network establishment. Preferably, the specific determination manner of the edge weight may be: and setting the weight of the side by integrating the characteristics of the call duration, the home location of the calling number and the called number and the like. Considering that the longer the call duration, the higher the fraud success probability, the more similar the numbers are attributed, the higher the fraud success rate (the fraud call is very easy to impersonate the local public security, the bank defrauding the victim, and the same province, the higher the fraud success rate in the same market). Therefore, the longer the call duration, the heavier the side, the more similar the number attribution (e.g., belonging to the same province or the same city), and the heavier the side can be set.
And 106, carrying out community discovery in the established ticket social network, and determining that the corresponding community is a fraud community or a non-fraud community according to the number of fraud numbers included in each community.
And carrying out community discovery on the established ticket social network. Various existing algorithms can be adopted for the specific community discovery algorithm, and the algorithm is not limited here. And marking the community as a fraud community and a non-fraud community according to the fraud call result determined in the step 103 and the community discovery result in the step. For example, when the ratio of the number of fraud numbers in the community to the total number of numbers in the community exceeds a set threshold, the community can be determined as a fraud community. Through community discovery and classification, a community is usually a group, the call relationship in the community is a fraud chain, and in practical tests, the relationship between a plurality of calling parties and a called party is discovered in most cases.
And step 107, converting all call voices in the fraud community into texts, performing text clustering, extracting new keywords, adding the extracted keywords into the fraud keyword set used in the step 103, and using the extracted keywords in the next fraud cluster judgment process.
Through the step 106, the fraud community is discovered, and then, the text data corresponding to the fraud call in the fraud community can be screened out through a set of natural language processing rules to remove the words (including stop words, auxiliary words, and spoken words) that are unlikely to become keywords, and finally, new keywords are generated and added to the current priori fraud keyword set (i.e., the fraud keyword set used in the step 103 in the current cycle). Specifically, when a new keyword is added, the keyword overlapping with the fraud keyword set does not need to be added repeatedly. Through continuous and cyclic processing, the fraud keyword set can be continuously and iteratively updated, so that the model is optimized, and the fraud recall rate and the fraud recall accuracy are further improved.
So far, the flow of the fraud call detection method in the application is ended.
The above is the specific implementation of the fraud call detection method in the present application. The application also provides a fraud call detection device which can be used for implementing the fraud call detection method. As shown in fig. 2, the apparatus includes: the system comprises a ticket preprocessing unit, a voice recognition unit, a text clustering unit, a text community discovery unit and a ticket community discovery unit.
The call ticket preprocessing unit is used for collecting all call voices and performing data preprocessing operation; specifically, the unit mainly collects the ticket data of the current day from a plurality of data sources, integrates the ticket data and provides the ticket data for the system to perform fraud detection. And the voice recognition unit is used for converting the conversation voice preprocessed by the ticket preprocessing unit into a text to form a text set.
The text clustering unit is used for converting each text in the text set into a keyword weight vector; forming a plurality of clusters for all the keyword weight vectors through text clustering, and determining whether each cluster is a fraud cluster according to the fraud keyword set; and determining the calls corresponding to all the keyword weight vectors in the fraud cluster as fraud calls.
The text community discovery unit is used for constructing a text social network by utilizing all calls and keywords, marking the nodes corresponding to the fraud calls as fraud calls in the text social network, and determining other nodes marked as fraud calls through label propagation; all calls corresponding to nodes marked as fraudulent calls are determined to be fraudulent calls.
The ticket community discovery unit is used for constructing a ticket social network by using all calling numbers of all calls in the fraud cluster and all numbers which have call relations with the calling numbers, and performing community discovery; determining that the corresponding community is a fraud community or a non-fraud community according to the number of fraud numbers included in each community in the ticket social network; and converting all call voices in the fraud community into texts, then performing text clustering, extracting new keywords, adding the keywords into the current fraud keyword set, and using the keywords in the process of converting all call voice texts into keyword weight vectors next time.
In the device shown in fig. 2, in consideration of cost, processing complexity, and the like, the device may not include the ticket community discovery unit, and the device added to the ticket community discovery unit may achieve a better fraud call recall rate.
The community can provide operators such as mobile, Unicom and telecom for rapid fraud identification. Through social network discovery, compared with the method of simply relying on one-way communication to perform fraud identification, the method has better accuracy and recall rate, can well discover fraud groups, and can play a good effect on the striking of the fraud groups, and the specific scenes are as follows: for the successful cases of fraud, the fraud is basically completed through multi-pass fraud layer-by-layer progression. What we want to do is to quickly detect fraud in the first two-way call where fraud has just started, and then to intercept the subsequent chain of fraud in time or remind the fraudulently-experienced user in time. The property safety of people is ensured.
As can be seen from the above-mentioned scenario, the present application aims to quickly and effectively identify fraudulent calls. The existing main identification means is that a fraud number is reported by a user mark through a smart phone. However, this approach is passive and cannot effectively cope with the number-changing behavior. Therefore, the social network fraud detection method based on behaviors and contents can quickly and effectively identify fraud, and once the fraud is identified, the Trojan horse is intercepted from the network side.
As described above, the fraud call detection method and apparatus of the present application can be applied to various fraud types through text clustering and complex network analysis, and meanwhile, user sensitive data does not need to be acquired, and the operability is stronger. Various problems mentioned in the background art are solved. Specifically, compared with the method one in the background art, the method does not compare the voice with the fraud voice library, but directly constructs a complex network according to the content and the semantics of the voice file for complex network analysis, so that the construction cost of the voice library is saved, and the realization difficulty is low; compared with the method II in the background art, the method adopts a mode of combining conventional clustering and complex network analysis to mine the call voice data, can more accurately and comprehensively mine suspicious numbers, particularly numbers with behavior characteristics close to normal calls, and can further mine fraud calls from the aspects of calling and called relationships, call content similarity and the like by establishing a network for the call voice and carrying out complex network analysis; compared with the method III in the background art, the method adopts an unsupervised algorithm such as clustering and complex network analysis, and avoids the dependence on the label; compared with the method IV in the background technology, the method of the application has the advantages that the construction of the complex network is only based on the call bill data and the voice content of the call, and the personal information does not need to be acquired, so that the networking difficulty is reduced. Compared with the fifth method in the background art, the method only needs to use the call ticket data and the call voice to construct a network, fully uses the characteristics of the call data, and is stronger in pertinence.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (7)

1. A method for detecting fraudulent calls, comprising:
converting all call voices into texts to form a text set; converting each text in the text set into a keyword weight vector;
forming a plurality of clusters for all the keyword weight vectors through text clustering, and determining whether each cluster is a fraud cluster according to the fraud keyword set; determining the calls corresponding to all the keyword weight vectors in the fraud cluster as fraud calls; wherein said determining whether each cluster is a fraudulent cluster comprises: selecting words included in the fraud keyword set from all keyword vectors as fraud keywords; for each cluster, comparing the characteristics of the cluster with the preset characteristics of the fraud cluster according to the fraud keyword set, calculating the weight sum x of all fraud keywords in the cluster, calculating the ratio of the weight sum x to all the keywords in the cluster, and if the ratio is greater than a preset threshold value, determining that the cluster is a fraud cluster;
constructing a text social network by using all calls and the keywords, marking nodes corresponding to the fraud calls as fraud calls in the text social network, and determining other nodes marked as fraud calls through label propagation; determining all calls corresponding to the nodes marked as fraud calls; wherein the constructing a textual social network comprises: modeling each keyword in all the keyword weight vectors as a node in the text social network, modeling a call text corresponding to each keyword weight vector as a node in the text social network, if any call text contains any keyword, adding an edge between the node of any call text and the node of any keyword, and setting the weight of the edge.
2. The method of claim 1, further comprising:
constructing a call ticket social network by using all calling numbers of all calls in the fraud cluster and all numbers which have call relations with the calling numbers, and performing community discovery; determining that the corresponding community is a fraud community or a non-fraud community according to the number of fraud numbers included in each community in the ticket social network; converting all call voices in the fraud community into texts, then performing text clustering, extracting new keywords, adding the keywords into the fraud keyword set, and using the keywords in the process of converting all call voice texts into keyword weight vectors next time;
wherein the fraud number is a calling number in the fraud call.
3. The method according to claim 1 or 2, wherein the converting each text in the text set into the keyword weight vector is performed in a TF-IDF manner;
the constructing of the text social network by using all calls and the keywords comprises: and taking the text converted from all call voices and the keywords in all the keyword weight vectors as nodes of the text social network, if the text comprises a keyword, adding an edge between the corresponding text node and the keyword node, wherein the weight of the edge is the TF-IDF value of the corresponding keyword in the corresponding text.
4. The method of claim 2, wherein constructing the ticket social network comprises:
and taking the calling numbers of all calls in the fraud cluster and all numbers which have a call relation with the calling numbers as nodes of the call bill social network, if one call exists between any two nodes, adding an edge between the corresponding nodes, and setting the weight of the corresponding edge according to the characteristics of the call corresponding to each edge.
5. The method according to claim 4, wherein the setting the weight of each edge according to the feature of the corresponding call of each edge comprises: determining the weight of a corresponding edge according to the comprehensive call duration of the call and the attribution of the calling number and the called number; the longer the comprehensive call duration is, the greater the weight of the side is, the more similar the attribution of the calling and called numbers are, and the greater the weight of the side is.
6. An apparatus for detecting fraudulent calls, comprising: the system comprises a ticket preprocessing unit, a voice recognition unit, a text clustering unit and a text community finding unit;
the call ticket preprocessing unit is used for collecting all call voices and performing data preprocessing operation;
the voice recognition unit is used for converting the conversation voice preprocessed by the ticket preprocessing unit into a text to form a text set;
the text clustering unit is used for converting each text in the text set into a keyword weight vector; forming a plurality of clusters for all the keyword weight vectors through text clustering, and determining whether each cluster is a fraud cluster according to the fraud keyword set; determining the calls corresponding to all the keyword weight vectors in the fraud cluster as fraud calls; wherein said determining whether each cluster is a fraudulent cluster comprises: selecting words included in the fraud keyword set from all keyword vectors as fraud keywords; for each cluster, comparing the characteristics of the cluster with the preset characteristics of the fraud cluster according to the fraud keyword set, calculating the weight sum x of all fraud keywords in the cluster, calculating the ratio of the weight sum x to all the keywords in the cluster, and if the ratio is greater than a preset threshold value, determining that the cluster is a fraud cluster;
the text community discovery unit is used for constructing a text social network by utilizing all calls and the keywords, marking nodes corresponding to the fraud calls as fraud calls in the text social network, and determining other nodes marked as fraud calls through label propagation; determining all calls corresponding to the nodes marked as fraud calls; wherein the constructing a textual social network comprises: modeling each keyword in all the keyword weight vectors as a node in the text social network, modeling a call text corresponding to each keyword weight vector as a node in the text social network, if any call text contains any keyword, adding an edge between the node of any call text and the node of any keyword, and setting the weight of the edge.
7. The detection apparatus according to claim 6, wherein the apparatus further comprises a ticket community discovery unit, configured to construct a ticket social network by using the calling numbers of all calls in the fraud cluster and all numbers having a call relationship with the calling numbers, and perform community discovery; determining that the corresponding community is a fraud community or a non-fraud community according to the number of fraud numbers included in each community in the ticket social network; and converting all call voices in the fraud community into texts, then performing text clustering, extracting new keywords, adding the keywords into the fraud keyword set, and using the keywords in the process of converting all call voice texts into keyword weight vectors next time.
CN201811219800.7A 2018-10-19 2018-10-19 Detection method and device for fraud telephone Active CN109451182B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811219800.7A CN109451182B (en) 2018-10-19 2018-10-19 Detection method and device for fraud telephone

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811219800.7A CN109451182B (en) 2018-10-19 2018-10-19 Detection method and device for fraud telephone

Publications (2)

Publication Number Publication Date
CN109451182A CN109451182A (en) 2019-03-08
CN109451182B true CN109451182B (en) 2021-08-13

Family

ID=65546669

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811219800.7A Active CN109451182B (en) 2018-10-19 2018-10-19 Detection method and device for fraud telephone

Country Status (1)

Country Link
CN (1) CN109451182B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109903772A (en) * 2019-03-13 2019-06-18 娄奥林 A kind of defence method of confrontation artificial intelligent voice intonation study true man's identification
CN110312047A (en) * 2019-06-24 2019-10-08 深圳市趣创科技有限公司 The method and device of automatic shield harassing call
CN110248322B (en) * 2019-06-28 2021-10-22 国家计算机网络与信息安全管理中心 Fraud group partner identification system and identification method based on fraud short messages
CN112399013B (en) * 2019-08-15 2021-12-03 中国电信股份有限公司 Abnormal telephone traffic identification method and device
CN110942783B (en) * 2019-10-15 2022-06-17 国家计算机网络与信息安全管理中心 Group call type crank call classification method based on audio multistage clustering
CN111031546B (en) * 2019-11-29 2023-09-19 武汉烽火众智数字技术有限责任公司 LR model training method applied to telephone number analysis and application method
CN111131627B (en) * 2019-12-20 2021-12-07 珠海高凌信息科技股份有限公司 Method, device and readable medium for detecting personal harmful call based on streaming data atlas
CN111884821B (en) * 2020-03-27 2022-04-29 马洪涛 Ticket data processing and displaying method and device and electronic equipment
CN111641756B (en) * 2020-05-13 2021-11-16 广州国音智能科技有限公司 Fraud identification method, device and computer readable storage medium
CN111669757B (en) * 2020-06-15 2023-03-14 国家计算机网络与信息安全管理中心 Terminal fraud call identification method based on conversation text word vector
CN112153220B (en) * 2020-08-26 2021-08-27 北京邮电大学 Communication behavior identification method based on social evaluation dynamic update
CN113378977B (en) * 2021-06-30 2023-11-21 中国农业银行股份有限公司 Recording data processing method and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107205244A (en) * 2016-03-18 2017-09-26 哈尔滨工业大学(威海) A kind of design method of the sensor network anomaly data detection based on temporal correlation
CN107590172A (en) * 2017-07-17 2018-01-16 北京捷通华声科技股份有限公司 A kind of the core content method for digging and equipment of extensive speech data
CN107680602A (en) * 2017-08-24 2018-02-09 平安科技(深圳)有限公司 Voice fraud recognition methods, device, terminal device and storage medium
CN107729919A (en) * 2017-09-15 2018-02-23 国网山东省电力公司电力科学研究院 In-depth based on big data technology is complained and penetrates analysis method
CN108121701A (en) * 2017-12-26 2018-06-05 深圳市海派通讯科技有限公司 A kind of anti-harassment automatic identifying method and its intelligent terminal
CN108280089A (en) * 2017-01-06 2018-07-13 阿里巴巴集团控股有限公司 Identify the method and apparatus sent a telegram here extremely

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9026551B2 (en) * 2013-06-25 2015-05-05 Hartford Fire Insurance Company System and method for evaluating text to support multiple insurance applications

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107205244A (en) * 2016-03-18 2017-09-26 哈尔滨工业大学(威海) A kind of design method of the sensor network anomaly data detection based on temporal correlation
CN108280089A (en) * 2017-01-06 2018-07-13 阿里巴巴集团控股有限公司 Identify the method and apparatus sent a telegram here extremely
CN107590172A (en) * 2017-07-17 2018-01-16 北京捷通华声科技股份有限公司 A kind of the core content method for digging and equipment of extensive speech data
CN107680602A (en) * 2017-08-24 2018-02-09 平安科技(深圳)有限公司 Voice fraud recognition methods, device, terminal device and storage medium
CN107729919A (en) * 2017-09-15 2018-02-23 国网山东省电力公司电力科学研究院 In-depth based on big data technology is complained and penetrates analysis method
CN108121701A (en) * 2017-12-26 2018-06-05 深圳市海派通讯科技有限公司 A kind of anti-harassment automatic identifying method and its intelligent terminal

Also Published As

Publication number Publication date
CN109451182A (en) 2019-03-08

Similar Documents

Publication Publication Date Title
CN109451182B (en) Detection method and device for fraud telephone
CN109600752B (en) Deep clustering fraud detection method and device
CN109615116B (en) Telecommunication fraud event detection method and system
CN106550155B (en) Swindle sample is carried out to suspicious number and screens the method and system sorted out and intercepted
CN108833720A (en) Fraudulent call number identification method and system
CN111405562B (en) Mobile malicious user identification method and system based on communication behavior rules
CN101686444B (en) System and method for detecting spam SMS sender number in real time
CN111131593B (en) Crank call identification method and device
CN106936997B (en) A kind of rubbish voice recognition methods and system based on social networks map
CN113794805A (en) Detection method and detection system for GOIP fraud telephone
CN107092651B (en) Key character mining method and system based on communication network data analysis
CN105825129A (en) Converged communication malicious software identification method and system
CN113420294A (en) Malicious code detection method based on multi-scale convolutional neural network
US20230208875A1 (en) Method of fraud detection in telecommunication using big data mining techniques
US11870932B2 (en) Systems and methods of gateway detection in a telephone network
CN113641827A (en) Phishing network identification method and system based on knowledge graph
CN112351429B (en) Harmful information detection method and system based on deep learning
CN105163296A (en) Multi-dimensional spam message filtering method and system
CN102932753A (en) Method for intercepting spam multimedia message on link of multimedia system
CN111131627B (en) Method, device and readable medium for detecting personal harmful call based on streaming data atlas
KR102332997B1 (en) Server, method and program that determines the risk of financial fraud
CN111465021B (en) Graph-based crank call identification model construction method
CN109033835A (en) A kind of method of isomery detection malicious code of mobile terminal with double engines
CN113645356A (en) Fraud telephone identification method and system based on in-network card opening behavior analysis
CN110310627A (en) It is a kind of for detecting the method and system of live user

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant