CN116362760A - Identification method of telecommunication fraud user and related product - Google Patents

Identification method of telecommunication fraud user and related product Download PDF

Info

Publication number
CN116362760A
CN116362760A CN202111583146.XA CN202111583146A CN116362760A CN 116362760 A CN116362760 A CN 116362760A CN 202111583146 A CN202111583146 A CN 202111583146A CN 116362760 A CN116362760 A CN 116362760A
Authority
CN
China
Prior art keywords
user
data
sample data
telecommunication fraud
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111583146.XA
Other languages
Chinese (zh)
Inventor
文可嘉
刘勇攀
刘毅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Telecom Corp Ltd
Original Assignee
China Telecom Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Telecom Corp Ltd filed Critical China Telecom Corp Ltd
Priority to CN202111583146.XA priority Critical patent/CN116362760A/en
Publication of CN116362760A publication Critical patent/CN116362760A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/01Customer relationship services
    • G06Q30/012Providing warranty services
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/40Business processes related to the transportation industry

Landscapes

  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Engineering & Computer Science (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • Physics & Mathematics (AREA)
  • General Business, Economics & Management (AREA)
  • Finance (AREA)
  • Development Economics (AREA)
  • Accounting & Taxation (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Resources & Organizations (AREA)
  • Primary Health Care (AREA)
  • Tourism & Hospitality (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The application belongs to the technical field of telecommunication networks, and particularly relates to a method for identifying telecommunication fraud users and related products. The method comprises the following steps: obtaining a sample data set comprising positive sample data belonging to telecommunication fraud users and negative sample data of undetermined user types; randomly sampling the sample data set to obtain a plurality of sub-data sets; extracting features of sample data in the sub-data set to obtain user behavior features with multiple dimensions; establishing a random forest model for judging whether the sample data is a telecommunication fraud user according to the user behavior characteristics of the multiple dimensions; and inputting the user data to be identified into the random forest model to judge whether the user data to be identified is a telecommunication fraud user or not through the random forest model. The method and the device can improve the prejudging efficiency of fraud calls, reduce the probability of misjudgment of normal users and improve the effectiveness of prejudgment.

Description

Identification method of telecommunication fraud user and related product
Technical Field
The present application belongs to the technical field of telecommunication networks, and in particular relates to a method for identifying a telecommunication fraud user, an identification device of the telecommunication fraud user, a computer readable medium, an electronic device and a computer program product.
Background
With the wide application and development of intelligent terminals and network technologies such as smart phones, the Internet and the like in recent years, telecommunication fraud in China is developed from regional small scale to national and hidden trend, and various types of telecommunication fraud crimes are becoming serious. Currently, the telecommunications industry lacks efficient identification judgment methods for the judgment of fraud calls.
Disclosure of Invention
The present application aims to provide a method for identifying a telecommunication fraud user, an identification device for the telecommunication fraud user, a computer readable medium, an electronic device and a computer program product, which at least overcome the technical problem that the identification of the telecommunication fraud user is difficult to a certain extent.
Other features and advantages of the present application will be apparent from the following detailed description, or may be learned in part by the practice of the application.
According to an aspect of the embodiments of the present application, there is provided a method for identifying a telecommunication fraud user, the method comprising:
obtaining a sample data set comprising positive sample data belonging to telecommunication fraud users and negative sample data of undetermined user types;
randomly sampling the sample data set to obtain a plurality of sub-data sets;
extracting features of sample data in the sub-data set to obtain user behavior features with multiple dimensions;
establishing a random forest model for judging whether the sample data is a telecommunication fraud user according to the user behavior characteristics of the multiple dimensions;
and inputting the user data to be identified into the random forest model to judge whether the user data to be identified is a telecommunication fraud user or not through the random forest model.
In some embodiments of the present application, based on the above technical solutions, obtaining a sample dataset includes:
classifying the telecommunication network users to obtain user data of which the user types are determined to be telecommunication fraud users or non-telecommunication fraud users and user data of which the user types are not determined;
randomly sampling user data of which the user type is determined to be a telecommunication fraud user to obtain positive sample data of a first preset number;
randomly sampling user data of an undetermined user type to obtain negative sample data of a second preset number, wherein the second preset number is a designated multiple of the first preset number;
the positive sample data and the negative sample data are combined into a sample data set.
In some embodiments of the present application, based on the above technical solution, randomly sampling the sample data set to obtain a plurality of sub data sets includes:
and randomly sampling the sample data set according to a preset sample proportion to obtain a plurality of sub-data sets with the same sample number, wherein the sample proportion is the sample number proportion of the positive sample data to the negative sample data.
In some embodiments of the present application, based on the above technical solution, the sample data includes identity information of a user, and the user behavior feature includes a linear feature distributed according to a time sequence and a nonlinear feature distributed according to a spatial sequence.
In some embodiments of the present application, based on the above technical solution, feature extraction is performed on sample data in the subset data set to obtain user behavior features with multiple dimensions, including:
acquiring data according to the sample data in the sub-data set to obtain time dimension data and space dimension data of a user, wherein the time dimension data comprises at least one of ticket data, flow data and long-time attribute data, and the space dimension data comprises geographic information data of the user;
extracting features of the time dimension data according to preset time granularity to obtain linear features distributed according to a time sequence;
and extracting the characteristics of the space dimension data according to a preset space grid to obtain nonlinear characteristics distributed according to a space sequence.
In some embodiments of the present application, based on the above technical solution, establishing a random forest model for determining whether the sample data is a telecommunication fraud user according to the user behavior characteristics of the multiple dimensions includes:
establishing a decision tree model corresponding to the sub-data set according to the user behavior characteristics of the multiple dimensions;
and forming a plurality of decision tree models into a random forest model for judging whether the sample data is a telecommunication fraud user.
In some embodiments of the present application, based on the above technical solution, establishing a decision tree model corresponding to the sub-dataset according to the user behavior characteristics of the multiple dimensions includes:
respectively acquiring the distribution probability of each item characteristic value in the user behavior characteristics of each dimension;
determining a redundancy index of the user behavior characteristic according to the distribution probability of each characteristic value, wherein the redundancy index is used for representing the distribution redundancy degree of the characteristic value;
sequencing the user behavior characteristics of multiple dimensions according to the redundancy index to obtain a characteristic sequence;
and establishing a decision tree model corresponding to the sub-data set by taking the characteristic sequence as a classification node.
According to an aspect of embodiments of the present application, there is provided an identification device of a telecommunication fraud user, the device comprising:
an acquisition module configured to acquire a sample data set comprising positive sample data belonging to a telecommunication fraud user and negative sample data of an undetermined user type;
the sampling module is configured to randomly sample the sample data set to obtain a plurality of sub-data sets;
the extraction module is configured to perform feature extraction on the sample data in the sub-data set to obtain user behavior features with multiple dimensions;
a building module configured to build a random forest model for judging whether the sample data is a telecommunication fraud user according to the user behavior characteristics of the plurality of dimensions;
an identification module configured to input user data to be identified into the random forest model to identify, by the random forest model, whether the user data to be identified is a telecommunication fraud user.
According to an aspect of embodiments of the present application, there is provided a computer readable medium having stored thereon a computer program which, when executed by a processor, implements a method of identifying telecommunication fraud users as in the above technical solutions.
According to an aspect of the embodiments of the present application, there is provided an electronic device including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the method of identifying telecommunication fraud users as in the above technical solution via execution of the executable instructions.
According to an aspect of embodiments of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions, so that the computer device performs the telecommunication fraud user identification method as in the above technical solution.
In the technical scheme provided by the embodiment of the application, the random forest model is established by collecting the user behavior characteristics, so that the telecommunication fraud user can be identified and judged according to the user behavior characteristics of multiple dimensions, the prejudging efficiency of the fraud telephone is greatly improved, the probability of misjudging the normal user is reduced, the effectiveness of prejudging is improved, and the labor confirmation cost is saved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application. It is apparent that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art.
Fig. 1 shows a block diagram of an exemplary system architecture to which the technical solution of the present application is applied.
Fig. 2 shows a flow chart of steps of a related art fraud user identification method.
FIG. 3 shows a flow chart of steps of a method of identifying a telecommunication fraud user in one embodiment of the present application.
FIG. 4 illustrates a functional block diagram of user behavior feature acquisition in one embodiment of the present application.
FIG. 5 shows a flowchart of the steps for constructing a decision tree model in one embodiment of the present application.
FIG. 6 shows a functional block diagram of telecommunications fraud user identification in one embodiment of the present application.
Fig. 7 shows a block diagram of a telecommunication fraud user identification apparatus according to an embodiment of the present application.
Fig. 8 shows a block diagram of a computer system suitable for use in implementing embodiments of the present application.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the present application. One skilled in the relevant art will recognize, however, that the aspects of the application can be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known methods, devices, implementations, or operations are not shown or described in detail to avoid obscuring aspects of the application.
The block diagrams depicted in the figures are merely functional entities and do not necessarily correspond to physically separate entities. That is, the functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.
The flow diagrams depicted in the figures are exemplary only, and do not necessarily include all of the elements and operations/steps, nor must they be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the order of actual execution may be changed according to actual situations.
Fig. 1 shows a block diagram of an exemplary system architecture to which the technical solution of the present application is applied.
As shown in fig. 1, system architecture 100 may include a terminal device 110, a network 120, and a server 130. Terminal device 110 may include various electronic devices such as smart phones, tablet computers, notebook computers, desktop computers, and the like. The server 130 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud computing services. Network 120 may be a communication medium of various connection types capable of providing a communication link between terminal device 110 and server 130, and may be, for example, a wired communication link or a wireless communication link.
The system architecture in the embodiments of the present application may have any number of terminal devices, networks, and servers, as desired for implementation. For example, the server 130 may be a server group composed of a plurality of server devices. In addition, the technical solution provided in the embodiment of the present application may be applied to the terminal device 110, or may be applied to the server 130, or may be implemented by the terminal device 110 and the server 130 together, which is not limited in particular in this application.
Fig. 2 shows a flow chart of steps of a related art fraud user identification method. As shown in fig. 2, the method for performing fraud user identification in the related art of the embodiment of the present application may mainly include the following steps S210 to S240.
Step S210: and constructing a fraud user information base.
The complaint received through the complaint telephone 12321 and 10000 is entered into the verified certificate information of the fraudulent user and the information of the serial number IEMI of the mobile phone, so as to form a fraudulent user information base.
Step S220: and constructing fraud user business rules.
And constructing fraud user business rules according to the characteristics of the user group, for example: 1) Matching fraud certificate number library through certificate numbers; 2) Selecting a high-risk roaming user group, and selecting a nationwide high-risk occurrence area of fraud calls, such as a situation that a number card roams to a corresponding area and frequent calling actions occur; 3) Defining a user call behavior rule, and defining that a calling party is called for more than 20 times within 30 minutes;
step S230: and judging the fraud user according to the business rule.
And (3) according to the business rules formulated in the step S220, judging and screening the newly-accessed users to obtain a final suspected fraud user list.
Step S240: confirmation and disposal of fraudulent users.
The manual return visit confirms whether the phone is fraud or not, and confirms that the fraud takes corresponding measures.
When the above related art is used for determination, the following problems are faced:
(1) The certificate number is used for matching with fraud phones, the current real-name system of the number cards needs to be subjected to live experience, the number cards which are not transacted by the user are fewer, the user group is small, and the coverage of the overall matching is affected;
(2) Selecting a user group with high fraud roaming places, wherein the current fraud situation is from a specific area to localization, the fraud mode is more concealed, and the method is limited to only partial fraud users with card roaming to foreign places, and the overall coverage is poor;
(3) The fixed user behavior rules are difficult to adapt to the continuously changing current situation, and the fixed caliber is easy to accidentally injure the normally used user according to the caliber of the user such as conversation, short messages, flow and the like;
currently, the telecom industry lacks efficient identification method judgment for judging fraud calls, and in order to further reduce occurrence of telecommunication fraud behaviors, a fraud-related identification method based on user behaviors is urgently needed.
The following describes in detail, with reference to specific embodiments, a method for identifying a telecommunication fraud user, an apparatus for identifying a telecommunication fraud user, a computer readable medium, an electronic device, and a computer program product.
FIG. 3 shows a flow chart of steps of a method of identifying a telecommunication fraud user in one embodiment of the present application. As shown in fig. 3, the identification method of telecommunication fraud users mainly includes the following steps S310 to S350.
Step S310: acquiring a sample data set comprising positive sample data belonging to telecommunication fraud users and negative sample data of undetermined user types;
step S320: randomly sampling the sample data set to obtain a plurality of sub-data sets;
step S330: extracting features of sample data in the sub-data set to obtain user behavior features with multiple dimensions;
step S340: establishing a random forest model for judging whether the sample data is a telecom fraud user according to the user behavior characteristics of multiple dimensions;
step S350: and inputting the user data to be identified into a random forest model to judge whether the user data to be identified is a telecommunication fraud user or not through the random forest model.
In the identification method of the telecommunication fraud user provided by the embodiment of the application, the random forest model is established by collecting the user behavior characteristics, so that the telecommunication fraud user can be identified and judged according to the user behavior characteristics of multiple dimensions, the prejudgment efficiency of the fraud telephone is greatly improved, the probability of misjudgment of a normal user is reduced, the effectiveness of prejudgment is improved, and the labor confirmation cost is saved.
The following describes in detail each method step of the identification method of the telecommunication fraud user in the embodiment of the present application.
In step S310, a sample data set is acquired, the sample data set comprising positive sample data belonging to a telecommunication fraud user and negative sample data of an undetermined user type.
In one embodiment of the present application, a method of acquiring a sample dataset may comprise the steps of: classifying the telecommunication network users to obtain user data of which the user types are determined to be telecommunication fraud users or non-telecommunication fraud users and user data of which the user types are not determined; randomly sampling user data of which the user type is determined to be a telecommunication fraud user to obtain positive sample data of a first preset number; randomly sampling the user data of the undetermined user type to obtain negative sample data of a second preset number, wherein the second preset number is a designated multiple of the first preset number; the positive sample data and the negative sample data are combined into a sample data set.
For example, in categorizing telecommunication network users, users who have been verified to have telecommunication fraud by complaints 12321 and 10000 may be marked as telecommunication fraud users, users who have no telecommunication fraud by the staff of the telecommunication network operator, home users, partner enterprise users, etc. may be marked as non-telecommunication fraud users, and other telecommunication network users other than telecommunication fraud users and non-telecommunication fraud users may be marked as user data of an undetermined user type. When the sample data set is constructed, the aggregation degree of the sample model user data can be reversely improved by eliminating non-telecommunication fraud users.
In one embodiment of the present application, a number a of telecommunication fraud users may be collected as positive sample data, constituting a positive sample data set a. And acquiring data according to m times of the positive sample number, and acquiring user data of undetermined user types with the acquisition number of m x a as negative sample data to form a negative sample data set B. Finally, the two data sets are combined to form a sample data set which is S= { A, B }.
In step S320, the sample data set is randomly sampled, resulting in a plurality of sub-data sets.
In one embodiment of the present application, a method for randomly sampling a sample data set to obtain a plurality of sub-data sets may include the steps of: randomly sampling the sample data set according to a preset sample proportion to obtain a plurality of sub-data sets with the same sample number, wherein the sample proportion is the sample number proportion of the positive sample data and the negative sample data.
For example, in the embodiment of the present application, random sampling may be performed according to the ratio of positive and negative samples 1:1, so as to obtain a sub-data set with the number of samples b. After multiple sample samplings are performed by the same sampling strategy, n sub-data sets can be obtained.
In step S330, feature extraction is performed on the sample data in the sub-data set, so as to obtain user behavior features with multiple dimensions.
In one embodiment of the present application, the sample data includes identity information of the user, which may include, for example, information of the user's name, phone number, identification card number, agent, etc. The user behavior features include linear features distributed in a time series and nonlinear features distributed in a spatial series.
In one embodiment of the present application, feature extraction is performed on sample data in a sub-data set to obtain user behavior features with multiple dimensions, which may include the following steps: acquiring data according to sample data in the sub-data set to obtain time dimension data and space dimension data of a user, wherein the time dimension data comprises at least one of ticket data, flow data and long-time attribute data, and the space dimension data comprises geographic information data of the user; performing feature extraction on the time dimension data according to the preset time granularity to obtain linear features distributed according to the time sequence; and extracting features of the space dimension data according to a preset space grid to obtain nonlinear features distributed according to the space sequence.
FIG. 4 illustrates a functional block diagram of user behavior feature acquisition in one embodiment of the present application. As shown in fig. 4, first, feature extraction is performed on sample data to obtain a linear feature 401 and a nonlinear feature 402. The linear feature 401 may include user ticket data, traffic data, long-diffuse attribute data, and the like, and the nonlinear feature 402 may include GIS positioning data of the user.
When the time dimension data is collected, the linear features 401 such as the user ticket data, the flow data and the long-diffuse attributes are converged according to the time dimension, and the time sequence features 403 taking months as time granularity are obtained. When the space dimension data are collected, space grid division can be performed on the provincial, municipal or other administrative division ranges according to the space granularity of 1km by 1km to obtain grid areas Gr01 and Gr02 … …, space grid data where the working time of a user is located are obtained by superposition analysis of GIS positioning data on the space grid where the working time of the user is located, and further the space sequence characteristics 404 distributed according to the space sequence are obtained. And then carrying out feature fusion on the time sequence features 403 and the space sequence features 404 to obtain user behavior features 405 with multiple dimensions.
In step S340, a random forest model for judging whether the sample data is a telecommunication fraud user is established according to the user behavior characteristics of the plurality of dimensions.
In one embodiment of the present application, a method of building a random forest model may include: establishing a decision tree model corresponding to the sub-data set according to the user behavior characteristics of multiple dimensions; a plurality of decision tree models are organized into a random forest model for determining whether the sample data is a telecommunication fraud user.
In one embodiment of the present application, a method of building a decision tree model corresponding to a sub-dataset according to user behavior characteristics of multiple dimensions may include: respectively acquiring the distribution probability of each item characteristic value in the user behavior characteristics of each dimension; determining a redundancy index of the user behavior characteristics according to the distribution probability of each characteristic value, wherein the redundancy index is used for representing the distribution redundancy degree of the characteristic values; sequencing the user behavior characteristics of multiple dimensions according to the redundancy index to obtain a characteristic sequence; and establishing a decision tree model corresponding to the sub-data set by taking the characteristic sequence as a branch node.
In one embodiment of the application, each user behavior feature can be sequentially used as a branch node of the decision tree model according to the sequence from high to low of the redundancy index, and finally the decision tree model that the root node continuously leads out the child nodes and finally reaches the leaf nodes is formed.
FIG. 5 shows a flowchart of the steps for constructing a decision tree model in one embodiment of the present application. As shown in fig. 5, in one embodiment of the present application, a method of constructing a decision tree model may include the following steps S510 to S550.
Step S510: and randomly collecting samples from the sample set S according to the sample proportion of positive and negative samples 1:1, and obtaining a sub-data set consisting of b sample data.
Step S520: step S510 is repeated, and random sampling is performed from the sample set S until n sub-data sets are obtained.
Step S530: a redundancy index of user behavior features in the sub-dataset is calculated.
The redundancy index PROM (D) is calculated by the following steps:
Figure BDA0003427584280000091
wherein e is a natural constant, d is the enumeration number of the characteristic values of a certain user behavior characteristic, and P (k) is the distribution probability of each characteristic value.
Step S540: and removing invalid user behavior features and used user behavior features, selecting the user behavior feature with the largest redundancy index, dividing the data set according to the value of the user behavior feature, and constructing a branch node of the decision tree.
Step S550: and repeatedly executing the step S540 until all the user behavior characteristics are used, and completing the construction of the decision tree model.
For each sub-data set formed by sampling in step S520, a corresponding decision tree model may be constructed, so that n decision tree models are constructed from n sub-data sets. The n decision tree models are combined to form a random forest model.
In step S350, the user data to be identified is input into a random forest model, so as to determine whether the user data to be identified is a telecommunication fraud user through the random forest model.
The random forest model consists of n decision tree models, each decision tree is independent, and after the user data to be identified are classified by the n decision tree models, a classification result output by each decision tree can be obtained, wherein the classification result can comprise the step of judging whether the user to be identified is a telecom fraud user or a non-telecom fraud user. And finally, voting according to the classification results output by the n decision trees to generate a final recognition result.
FIG. 6 shows a functional block diagram of telecommunications fraud user identification in one embodiment of the present application. As shown in fig. 6, the random forest model 601 includes n decision tree models, namely, sub model 1 and sub model 2 … … sub model n. Before user identification, the random forest model 601 may be screened, and after irrelevant submodels therein are removed, an updated random forest model 602 composed of m decision tree models is obtained, where the updated random forest model includes an effective submodel 1 and an effective submodel 2 … … effective submodel m. For example, embodiments of the present application may utilize a pre-established verification data set to perform multiple model verifications on random forest model 601, thereby eliminating decision trees in which recognition accuracy is low.
The predicted inventory sample 603 is input to the random forest model 602, and a predicted suspected fraud phone inventory 604 may be output.
The method and the device can solve the problems of the existing fraud telephone that the characteristics are not clear, the recognition difficulty is high, and the recognition accuracy and stability are low. By combining time sequence data and space data analysis, classification accuracy can be improved by using a reasonable classification algorithm. The fraud behavior characteristics are randomly sampled to establish a plurality of decision trees, so that the influence degree of a certain characteristic on the whole prejudgment result is reduced, the prejudgment efficiency of fraud calls is greatly improved, the probability of misjudgment of normal users is reduced, the effectiveness of prejudgment is improved, the labor confirmation cost is saved, the social stability and harmony are further improved, and more social responsibility is borne.
It should be noted that although the steps of the methods in the present application are depicted in the accompanying drawings in a particular order, this does not require or imply that the steps must be performed in that particular order, or that all illustrated steps be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform, etc.
The following describes an embodiment of an apparatus of the present application, which may be used to perform the method for identifying telecommunication fraud users in the above-described embodiments of the present application. Fig. 7 schematically shows a block diagram of a telecommunication fraud user identification apparatus provided in an embodiment of the present application. As shown in fig. 7, the identification device 700 of a telecommunication fraud user may mainly include:
an acquisition module 710 configured to acquire a sample data set comprising positive sample data belonging to a telecommunication fraud user and negative sample data of an undetermined user type;
a sampling module 720 configured to randomly sample the sample data set to obtain a plurality of sub-data sets;
the extracting module 730 is configured to perform feature extraction on the sample data in the sub-data set to obtain user behavior features with multiple dimensions;
a building module 740 configured to build a random forest model for judging whether the sample data is a telecommunication fraud user according to the user behavior characteristics of the plurality of dimensions;
an identification module 750 configured to input user data to be identified into the random forest model to identify, by the random forest model, whether the user data to be identified is a telecommunication fraud user.
In one embodiment of the present application, based on the above embodiment, the obtaining module 710 may further include:
a user classification module 711 configured to classify the telecommunication network users to obtain user data of which the user type is determined to be a telecommunication fraud user or a non-telecommunication fraud user and user data of which the user type is not determined;
a positive sample sampling module 712 configured to randomly sample in the user data for which the user type is determined to be a telecommunication fraud user, obtaining a first preset number of positive sample data;
a negative sample sampling module 713 configured to randomly sample user data of an undetermined user type to obtain a second preset number of negative sample data, the second preset number being a specified multiple of the first preset number;
a sample combination module 714 is configured to combine the positive sample data and the negative sample data into a sample data set.
In one embodiment of the present application, based on the above embodiments, the sampling module 720 may be further configured to: and randomly sampling the sample data set according to a preset sample proportion to obtain a plurality of sub-data sets with the same sample number, wherein the sample proportion is the sample number proportion of the positive sample data to the negative sample data.
In one embodiment of the present application, based on the above embodiments, the sample data includes identity information of the user, and the user behavior features include linear features distributed in time series and nonlinear features distributed in spatial series.
In one embodiment of the present application, based on the above embodiments, the extracting module 730 may further include:
the data acquisition module 731 is configured to perform data acquisition according to the sample data in the sub-data set to obtain time dimension data and space dimension data of the user, where the time dimension data includes at least one of ticket data, flow data and long-diffuse attribute data, and the space dimension data includes geographic information data of the user;
the time feature extraction module 732 is configured to perform feature extraction on the time dimension data according to a preset time granularity, so as to obtain linear features distributed according to a time sequence;
the spatial feature extraction module 733 is configured to perform feature extraction on the spatial dimension data according to a preset spatial grid, so as to obtain nonlinear features distributed according to a spatial sequence.
In one embodiment of the present application, based on the above embodiments, the establishing module 740 may further include:
a decision tree building module 741 configured to build a decision tree model corresponding to the sub-dataset according to the user behavior characteristics of the plurality of dimensions;
random forest building module 742 is configured to compose a plurality of said decision tree models into a random forest model for determining whether said sample data is a telecommunication fraud user.
In one embodiment of the present application, based on the above embodiments, the decision tree creation module 741 may be further configured to: respectively acquiring the distribution probability of each item characteristic value in the user behavior characteristics of each dimension; determining a redundancy index of the user behavior characteristic according to the distribution probability of each characteristic value, wherein the redundancy index is used for representing the distribution redundancy degree of the characteristic value; sequencing the user behavior characteristics of multiple dimensions according to the redundancy index to obtain a characteristic sequence; and establishing a decision tree model corresponding to the sub-data set by taking the characteristic sequence as a classification node.
Specific details of the identification device for telecommunication fraud users provided in the embodiments of the present application have been described in detail in the corresponding method embodiments, and are not described herein again.
Fig. 8 schematically shows a block diagram of a computer system for implementing an electronic device according to an embodiment of the present application.
It should be noted that, the computer system 800 of the electronic device shown in fig. 8 is only an example, and should not impose any limitation on the functions and the application scope of the embodiments of the present application.
As shown in fig. 8, the computer system 800 includes a central processing unit 801 (Central Processing Unit, CPU) which can execute various appropriate actions and processes according to a program stored in a Read-Only Memory 802 (ROM) or a program loaded from a storage section 808 into a random access Memory 803 (Random Access Memory, RAM). In the random access memory 803, various programs and data required for system operation are also stored. The central processing unit 801, the read only memory 802, and the random access memory 803 are connected to each other through a bus 804. An Input/Output interface 805 (i.e., an I/O interface) is also connected to the bus 804.
The following components are connected to the input/output interface 805: an input portion 806 including a keyboard, mouse, etc.; an output portion 807 including a Cathode Ray Tube (CRT), a liquid crystal display (Liquid Crystal Display, LCD), and the like, and a speaker, and the like; a storage section 808 including a hard disk or the like; and a communication section 809 including a network interface card such as a local area network card, modem, or the like. The communication section 809 performs communication processing via a network such as the internet. The drive 810 is also connected to the input/output interface 805 as needed. A removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 810 as needed so that a computer program read out therefrom is mounted into the storage section 808 as needed.
In particular, according to embodiments of the present application, the processes described in the various method flowcharts may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via the communication section 809, and/or installed from the removable media 811. The computer programs, when executed by the central processor 801, perform the various functions defined in the system of the present application.
It should be noted that, the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-Only Memory (ROM), an erasable programmable read-Only Memory (Erasable Programmable Read Only Memory, EPROM), flash Memory, an optical fiber, a portable compact disc read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present application, however, a computer-readable signal medium may include a data signal that propagates in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functions of two or more modules or units described above may be embodied in one module or unit, in accordance with embodiments of the present application. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.
From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a usb disk, a mobile hard disk, etc.) or on a network, and includes several instructions to cause a computing device (may be a personal computer, a server, a touch terminal, or a network device, etc.) to perform the method according to the embodiments of the present application.
Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains.
It is to be understood that the present application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims (10)

1. A method for identifying telecommunication fraud subscribers, comprising:
obtaining a sample data set comprising positive sample data belonging to telecommunication fraud users and negative sample data of undetermined user types;
randomly sampling the sample data set to obtain a plurality of sub-data sets;
extracting features of sample data in the sub-data set to obtain user behavior features with multiple dimensions;
establishing a random forest model for judging whether the sample data is a telecommunication fraud user according to the user behavior characteristics of the multiple dimensions;
and inputting the user data to be identified into the random forest model to judge whether the user data to be identified is a telecommunication fraud user or not through the random forest model.
2. The telecommunications fraud user identification method of claim 1, wherein obtaining a sample dataset comprises:
classifying the telecommunication network users to obtain user data of which the user types are determined to be telecommunication fraud users or non-telecommunication fraud users and user data of which the user types are not determined;
randomly sampling user data of which the user type is determined to be a telecommunication fraud user to obtain positive sample data of a first preset number;
randomly sampling user data of an undetermined user type to obtain negative sample data of a second preset number, wherein the second preset number is a designated multiple of the first preset number;
the positive sample data and the negative sample data are combined into a sample data set.
3. The method for identifying a telecommunication fraud user according to claim 1, characterized in that randomly sampling said sample data set, obtaining a plurality of sub-data sets, comprises:
and randomly sampling the sample data set according to a preset sample proportion to obtain a plurality of sub-data sets with the same sample number, wherein the sample proportion is the sample number proportion of the positive sample data to the negative sample data.
4. The telecommunication fraud user identification method according to claim 1, wherein the sample data includes identity information of a user, and the user behavior features include linear features distributed in time series and nonlinear features distributed in spatial series.
5. The method for identifying a telecommunication fraud user according to claim 4, wherein extracting features from the sample data in the subset of data to obtain user behavior features with multiple dimensions comprises:
acquiring data according to the sample data in the sub-data set to obtain time dimension data and space dimension data of a user, wherein the time dimension data comprises at least one of ticket data, flow data and long-time attribute data, and the space dimension data comprises geographic information data of the user;
extracting features of the time dimension data according to preset time granularity to obtain linear features distributed according to a time sequence;
and extracting the characteristics of the space dimension data according to a preset space grid to obtain nonlinear characteristics distributed according to a space sequence.
6. The telecommunication fraud user identification method of claim 1, wherein establishing a random forest model for judging whether the sample data is a telecommunication fraud user according to the plurality of dimensional user behavior features comprises:
establishing a decision tree model corresponding to the sub-data set according to the user behavior characteristics of the multiple dimensions;
and forming a plurality of decision tree models into a random forest model for judging whether the sample data is a telecommunication fraud user.
7. The telecommunications fraud user identification method of claim 6, wherein establishing a decision tree model corresponding to the sub-data set according to the plurality of dimensions of user behavior features comprises:
respectively acquiring the distribution probability of each item characteristic value in the user behavior characteristics of each dimension;
determining a redundancy index of the user behavior characteristic according to the distribution probability of each characteristic value, wherein the redundancy index is used for representing the distribution redundancy degree of the characteristic value;
sequencing the user behavior characteristics of multiple dimensions according to the redundancy index to obtain a characteristic sequence;
and establishing a decision tree model corresponding to the sub-data set by taking the characteristic sequence as a classification node.
8. An identification device for telecommunication fraud users, characterized in that it comprises:
an acquisition module configured to acquire a sample data set comprising positive sample data belonging to a telecommunication fraud user and negative sample data of an undetermined user type;
the sampling module is configured to randomly sample the sample data set to obtain a plurality of sub-data sets;
the extraction module is configured to perform feature extraction on the sample data in the sub-data set to obtain user behavior features with multiple dimensions;
a building module configured to build a random forest model for judging whether the sample data is a telecommunication fraud user according to the user behavior characteristics of the plurality of dimensions;
an identification module configured to input user data to be identified into the random forest model to identify, by the random forest model, whether the user data to be identified is a telecommunication fraud user.
9. A computer readable medium, characterized in that it has stored thereon a computer program, which when executed by a processor implements the telecommunication fraud user identification method of any of claims 1 to 7.
10. An electronic device, comprising:
a processor; and
a memory for storing executable instructions of the processor;
wherein the processor is configured to cause the electronic device to perform the method of identifying a telecommunication fraud user of any of claims 1 to 7 via execution of the executable instructions.
CN202111583146.XA 2021-12-22 2021-12-22 Identification method of telecommunication fraud user and related product Pending CN116362760A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111583146.XA CN116362760A (en) 2021-12-22 2021-12-22 Identification method of telecommunication fraud user and related product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111583146.XA CN116362760A (en) 2021-12-22 2021-12-22 Identification method of telecommunication fraud user and related product

Publications (1)

Publication Number Publication Date
CN116362760A true CN116362760A (en) 2023-06-30

Family

ID=86914281

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111583146.XA Pending CN116362760A (en) 2021-12-22 2021-12-22 Identification method of telecommunication fraud user and related product

Country Status (1)

Country Link
CN (1) CN116362760A (en)

Similar Documents

Publication Publication Date Title
CN110298547A (en) Methods of risk assessment, device, computer installation and storage medium
CN108491720B (en) Application identification method, system and related equipment
CN108989581B (en) User risk identification method, device and system
CN113627566A (en) Early warning method and device for phishing and computer equipment
CN111510368B (en) Family group identification method, device, equipment and computer readable storage medium
CN111639690A (en) Fraud analysis method, system, medium, and apparatus based on relational graph learning
CN113688490A (en) Network co-construction sharing processing method, device, equipment and storage medium
CN112308173A (en) Multi-target object evaluation method based on multi-evaluation factor fusion and related equipment thereof
CN112950359B (en) User identification method and device
CN110807546A (en) Community grid population change early warning method and system
CN111401478B (en) Data anomaly identification method and device
CN110992230B (en) Full-scale demographic method, device and server based on terminal signaling data
CN116362760A (en) Identification method of telecommunication fraud user and related product
CN112764923B (en) Computing resource allocation method, computing resource allocation device, computer equipment and storage medium
CN114970495A (en) Name disambiguation method and device, electronic equipment and storage medium
CN111241297B (en) Atlas data processing method and apparatus based on label propagation algorithm
CN114238062A (en) Board card burning device performance analysis method, device, equipment and readable storage medium
CN113850669A (en) User grouping method and device, computer equipment and computer readable storage medium
CN115495570A (en) Application user classification method, application user evaluation method, application user classification device, application user evaluation device and application user evaluation equipment
CN112308694A (en) Method and device for discovering cheating group
CN112333708A (en) Telecommunication fraud detection method and system based on bidirectional gating circulation unit
CN112347441A (en) Power terminal identity authentication method and system based on trusted behavior sequence
CN111429257A (en) Transaction monitoring method and device
CN114630314B (en) Updating method, device, equipment and storage medium of terminal information base
CN109241428B (en) Method, device, server and storage medium for determining gender of user

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination