CN109284380B - Illegal user identification method and device based on big data analysis and electronic equipment - Google Patents

Illegal user identification method and device based on big data analysis and electronic equipment Download PDF

Info

Publication number
CN109284380B
CN109284380B CN201811120248.6A CN201811120248A CN109284380B CN 109284380 B CN109284380 B CN 109284380B CN 201811120248 A CN201811120248 A CN 201811120248A CN 109284380 B CN109284380 B CN 109284380B
Authority
CN
China
Prior art keywords
user set
identified
users
clusters
legal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811120248.6A
Other languages
Chinese (zh)
Other versions
CN109284380A (en
Inventor
孙家棣
马宁
于洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201811120248.6A priority Critical patent/CN109284380B/en
Priority to PCT/CN2018/125248 priority patent/WO2020062690A1/en
Publication of CN109284380A publication Critical patent/CN109284380A/en
Application granted granted Critical
Publication of CN109284380B publication Critical patent/CN109284380B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/30Authentication, i.e. establishing the identity or authorisation of security principals
    • G06F21/31User authentication
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The disclosure relates to the technical field of big data, and provides an illegal user identification method and device based on big data analysis, electronic equipment and a computer readable storage medium, wherein the method comprises the following steps: acquiring effective characteristic data of a user set to be identified and a legal user set; clustering the effective characteristic data of the legal user set to determine the cluster number; clustering the effective characteristic data of the user set to be identified and the legal user set according to the number of the clusters to obtain a plurality of clusters; and screening abnormal clusters from the plurality of clusters, wherein the abnormal clusters are clusters with the number of legal users smaller than a preset threshold value in the plurality of clusters, and the users to be identified are confirmed to be illegal users in the user set. According to the technical scheme provided by the invention, the false registered users can be identified in batches by adopting a clustering mode, so that the identification efficiency is improved, and the false registered users are not identified by adopting a behavior feature matching mode, so that the identification accuracy is improved.

Description

Illegal user identification method and device based on big data analysis and electronic equipment
Technical Field
The disclosure relates to the technical field of big data, in particular to an illegal user identification method and device based on big data analysis, electronic equipment and a computer readable storage medium.
Background
Currently, the popularity of smart terminals such as smartphones provides a carrier for various types of APP (Application). Numerous dead users, or numerous users present and active specifically for brushing, are all overactive on various types of APP, both of which are users with false registration, the presence of which interferes with the normal order on the network on the one hand and causes waste of resources on the other hand.
Aiming at false registered users existing in the current situation, the traditional method is to manually judge and delete the false users, so that the working efficiency is lower. The existing method is to analyze and summarize behavior characteristics of false registered users to form a characteristic library, and further to determine whether unknown users belong to false users or not in a behavior characteristic matching mode. Because the feature library has a limited data size and is updated slowly, a fake registered user may bypass features recorded in the feature library, and thus the fake registered user cannot be accurately identified.
Disclosure of Invention
In order to solve the problem that false registered users cannot be accurately identified in the related art, the disclosure provides an illegal user identification method based on big data analysis.
In one aspect, the invention provides an illegal user identification method based on big data analysis, which comprises the following steps:
acquiring effective characteristic data of a user set to be identified and a legal user set;
clustering the effective characteristic data of the legal user set to determine the cluster number;
clustering the effective characteristic data of the user set to be identified and the legal user set according to the cluster number to obtain a plurality of clusters;
and screening abnormal clusters from the plurality of clusters, wherein the abnormal clusters are clusters with the number of legal users smaller than a preset threshold value in the plurality of clusters, and the users to be identified are confirmed to be illegal users in the user set.
Optionally, before the obtaining the valid feature data of the user set to be identified and the legal user set, the method further includes:
acquiring service data of a user set to be identified and a legal user set;
and extracting effective characteristics of the service data of the user set to be identified and the legal user set to obtain the effective characteristic data of the user set to be identified and the legal user set.
Optionally, the service data includes a plurality of feature variables, and the extracting the effective features of the service data of the to-be-identified user set and the legal user set to obtain the effective feature data of the to-be-identified user set and the legal user set includes:
and removing the characteristic variables with the same variable value from the characteristic variables of the user set to be identified and the legal user set, and forming the effective characteristic data by the characteristic variables with different residual variable values.
Optionally, the service data includes a plurality of feature variables, and the extracting effective features of the service data of the to-be-identified user set and the legal user set to obtain effective feature data of the to-be-identified user set and the legal user set, and further includes:
counting a first occurrence frequency of each variable value of the characteristic variable in a legal user set and a second occurrence frequency of each variable value in a user set to be identified;
if the difference between the first occurrence frequency and the second occurrence frequency is larger than a preset range, the characteristic variable belongs to effective characteristic data.
Optionally, the service data includes a plurality of feature variables, and the extracting effective features of the service data of the to-be-identified user set and the legal user set to obtain effective feature data of the to-be-identified user set and the legal user set, and further includes:
estimating the predicted frequency of each variable value in the user set to be identified according to the frequency of each variable value in the legal user set;
counting the real frequency number of the variable value in the user set to be identified, and if the real frequency number is larger than a predicted frequency number and the real frequency number is larger than a first preset value and the predicted frequency number is smaller than a second preset value, the characteristic variable belongs to effective characteristic data; wherein the first preset value is greater than the second preset value.
Optionally, an abnormal cluster is selected from the plurality of clusters, where the abnormal cluster is a cluster in which the number of legal users in the plurality of clusters is smaller than a preset threshold, and the user to be identified is confirmed to be concentrated, and the user clustered to the abnormal cluster is an illegal user, including:
screening abnormal clusters from the plurality of clusters, wherein the abnormal clusters are clusters with the legal users in the plurality of clusters less than a preset threshold;
verifying whether the registration time of the user in the abnormal cluster and the residual storage space of the equipment show a negative correlation relationship or not;
and if the negative correlation relationship is presented, determining illegal users in the user set to be identified according to the users in the abnormal cluster.
Optionally, the determining, according to the users in the abnormal cluster, illegal users in the user set to be identified includes:
classifying the users with the same total storage space and the same equipment starting time into a class according to the total storage space and the equipment starting time of the users in the abnormal cluster;
and calculating the correlation coefficient of the registration time of each type of user and the residual storage space of the equipment respectively, and if the correlation coefficient meets the specified range, the users contained in the current category belong to illegal users, so that the illegal users in the user set to be identified are obtained.
On the other hand, the invention also provides an illegal user identification device based on big data analysis, which comprises:
the data acquisition module is used for acquiring effective characteristic data of the user set to be identified and the legal user set;
the cluster number determining module is used for clustering the effective characteristic data of the legal user set to determine the cluster number;
the user clustering module is used for clustering the effective characteristic data of the user set to be identified and the legal user set according to the cluster number to obtain a plurality of cluster clusters;
the abnormal cluster screening module is used for screening abnormal clusters from the plurality of clusters, wherein the abnormal clusters are clusters with the legal users less than a preset threshold value in the plurality of clusters, the user set to be identified is confirmed, and the users clustered to the abnormal clusters are illegal users.
In addition, the invention also provides electronic equipment, which comprises:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to perform the above-described big data analysis based illegal user identification method.
Furthermore, the invention also provides a computer readable storage medium storing a computer program executable by a processor to perform the illegal user identification method based on big data analysis.
The technical scheme provided by the embodiment of the disclosure can comprise the following beneficial effects:
according to the technical scheme provided by the invention, the effective characteristic data of the legal user set is clustered, so that the proper cluster number can be determined, the effective characteristic data of the user set to be identified and the legal user set are clustered according to the cluster number, the cluster class with the smaller number of legal users can be regarded as an abnormal cluster, and further, the users classified into the abnormal cluster to be identified can be regarded as illegal users. According to the technical scheme provided by the invention, the false registered users can be identified in batches by adopting a clustering mode, so that the identification efficiency is improved, and the false registered users are not identified by adopting a behavior feature matching mode, so that the identification accuracy is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
FIG. 1 is a schematic illustration of an implementation environment in accordance with the present disclosure;
FIG. 2 is a block diagram of a server shown in accordance with an exemplary embodiment;
FIG. 3 is a flow chart illustrating a method of illegal user identification based on big data analysis, according to an exemplary embodiment;
FIG. 4 is a flow chart of an illegal user identification method based on big data analysis, which is shown in another exemplary embodiment based on the corresponding embodiment of FIG. 3;
FIG. 5 is a detailed flowchart of step 302 in the corresponding embodiment of FIG. 4;
FIG. 6 is a detailed flow chart of step 302 in the corresponding embodiment of FIG. 4;
FIG. 7 is a detailed flowchart of step 370 in the corresponding embodiment of FIG. 3;
FIG. 8 is a schematic diagram showing a negative correlation between registration time and remaining memory of a device;
FIG. 9 is a detailed flowchart of step 373 in the corresponding embodiment of FIG. 7;
FIG. 10 is a schematic diagram of registration time and device remaining memory space relationship for 4 devices performing a batch false registration;
fig. 11 is a block diagram illustrating an illegal user recognition device based on big data analysis according to an exemplary embodiment.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the invention. Rather, they are merely examples of apparatus and methods consistent with aspects of the invention as detailed in the accompanying claims.
Fig. 1 is a schematic diagram of an implementation environment in accordance with the present disclosure. The implementation environment comprises: a plurality of mobile terminals 110 and a server 120.
A plurality of mobile terminals 110 and a server 120 are connected by a wired or wireless network. Each mobile terminal 110 requests a user account registration from the server 120 by running the software APP. The server 120 may employ the scheme provided by the present invention to identify illegitimate users (including falsely registered users).
It should be noted that the illegal user identification method based on big data analysis provided by the present invention is not limited to the corresponding processing logic deployed in the server 120, but may be processing logic deployed in other machines. For example, the processing logic of the illegal user identification method of the present invention is deployed in a terminal device with computing capability.
Referring to fig. 2, fig. 2 is a schematic diagram of a server according to an embodiment of the present invention. The server 200 may vary considerably in configuration or performance and may include one or more central processing units (central processing units, CPU) 222 (e.g., one or more processors) and memory 232, one or more storage media 230 (e.g., one or more mass storage devices) storing applications 242 or data 244. Wherein the memory 232 and storage medium 230 may be transitory or persistent. The program stored in the storage medium 230 may include one or more modules (not shown in the drawing), each of which may include a series of instruction operations in the server 200. Still further, the central processor 222 may be configured to communicate with the storage medium 230 and execute a series of instruction operations in the storage medium 230 on the server 200. The Server 200 may also include one or more power supplies 226, one or more wired or wireless network interfaces 250, one or more input/output interfaces 258, and/or one or more operating systems 241, such as Windows Server TM ,Mac OS XTM ,UnixTM,Linux TM ,FreeBSD TM Etc. The steps performed by the server described in the embodiments shown in fig. 3-7 and 9 below may be based on the server structure shown in fig. 2.
Those of ordinary skill in the art will appreciate that all or a portion of the steps implementing the embodiments described below may be implemented by hardware, or may be implemented by a program for instructing the relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
Fig. 3 is a flow chart illustrating a method of illegal user identification based on big data analysis, according to an exemplary embodiment. The application range and the execution subject of the illegal user identification method based on big data analysis may be a server, and the server may be the server 120 of the implementation environment shown in fig. 1. As shown in fig. 3, the method may include the following steps.
In step 310, valid feature data for the set of users to be identified as well as the set of legitimate users is obtained.
Wherein, illegal users refer to users who abnormally use APP (such as Jin Guangu APP) and are opposite to legal users. The illegal user may be a user generated by false registration in a black industry lot. The set of users to be identified comprises a plurality of users but the validity of the users is unknown, i.e. whether the users belong to false registration or not is not determined yet. The legal user set refers to determining a plurality of users belonging to the APP that are normally registered and used. The set of legitimate users may be a set of whitelisted users such as formal business personnel, life insurance personnel, policy users, fund users, and the like. The effective feature data is used to characterize basic information of the user, such as location information, device information, registered phone number, registration time, etc.
In step 330, the valid feature data of the legal user set is clustered to determine the number of clusters.
Wherein the process of dividing a collection of physical or abstract objects into classes consisting of similar objects is called clustering. The cluster number refers to the number of classes composed of similar objects, i.e., the number of classified classes. Specifically, the k-means clustering algorithm can be adopted to cluster the effective characteristic data of the legal user set, and when different cluster numbers are calculated respectively through traversing the cluster numbers, namely trying to gather into 2 types, 3 types, 4 types and the like, the total intra-cluster variation sum of the legal user set is calculated. Wherein, the total intra-cluster variance and formula are calculated as follows:
Figure SMS_1
wherein S represents the total intra-cluster variance sum; m represents the number of clusters; p represents a sample instance in the legal user set; c i Is the center of cluster i; d (x, y) represents the Euclidean distance of two points x and y.
Taking the cluster number with the minimum S value as the most suitable cluster number. The total intra-cluster variation sum is used for representing the sum of the similarities in all clusters, and when the total intra-cluster variation sum is minimum, the sum of the similarities in all clusters reaches the highest, namely similar users are clustered into the same cluster, dissimilar users are in another cluster, and the number of clusters reached at the moment can be considered as the most suitable number of clusters.
In step 350, the valid feature data of the user set to be identified and the legal user set are clustered according to the cluster number, so as to obtain a plurality of clusters.
Specifically, a k-means clustering algorithm may be used to cluster the valid feature data of the user set to be identified and the legal user set according to the most suitable cluster number determined in step 330. For example, when the number of clusters is 4, the total cluster becomes worse and smallest, so that all users can be divided into 4 clusters according to the effective feature data of the user set to be identified and the effective feature data of the legal user set. It should be noted that a cluster generated by clustering is a set of data objects, which are similar to objects in the same cluster and different from objects in other clusters.
In step 370, an abnormal cluster is selected from the plurality of clusters, where the abnormal cluster is a cluster in which the number of legal users in the plurality of clusters is smaller than a preset threshold, and it is confirmed that the users to be identified are concentrated, and the users clustered to the abnormal cluster are illegal users.
It should be explained that, because users in a cluster have higher similarity, users in different clusters have higher variability. All clusters without legal users or with very few legal users (less than a certain preset threshold value) are abnormal clusters. That is, no legitimate user is responsible for the legitimacy of such clusters, considered an outlier cluster. Further, the users in the set of users to be identified classified into the abnormal cluster may be regarded as illegal users.
According to the technical scheme provided by the exemplary embodiment of the invention, the effective characteristic data of the legal user set is clustered, so that the proper cluster number can be determined, the effective characteristic data of the user set to be identified and the legal user set are clustered according to the cluster number, the cluster class with the smaller number of legal users can be regarded as an abnormal cluster, and further, the users classified into the abnormal cluster to be identified can be regarded as illegal users. According to the technical scheme provided by the invention, the false registered users can be identified in batches by adopting a clustering mode, so that the identification efficiency is improved, and the false registered users are not identified by adopting a behavior feature matching mode, so that the identification accuracy is improved.
In an exemplary embodiment, as shown in fig. 4, before the step 310, the illegal user identification method based on big data analysis provided by the present invention further includes the following steps:
in step 301, service data of a user set to be identified and a legal user set are acquired;
the service data includes a registered phone number, a registration time, sdk (registered device information) data, and the like. Sdk data includes: the packet name of the access App, the version number of the access App, the operating system version number, latitude and longitude information, a SIM (subscriber identity module) card serial number, an IMSI (international mobile subscriber identity), an IMEI (international mobile equipment identity), a Mac address of the device, and the like. Further, the service data may further include data derived from the above data, such as GPS data latitude and longitude information, a mobile phone number carrier number segment (three digits before the mobile phone number), digits from the 4 th to the 7 th digits of the mobile phone number, whether the attribution of the carrier is consistent, whether the network type is wifi but the wifi name of the connection is null, the first half data of ip, and the battery power level. Abnormal data and missing data can be filtered out as required, and the legal user account is marked as 1, otherwise, the legal user account is marked as 0.
In step 302, effective feature extraction is performed on the service data of the user set to be identified and the legal user set, so as to obtain effective feature data of the user set to be identified and the legal user set.
The service data includes a plurality of data types, and as described above, the service data includes registered mobile phone numbers, registration time, sdk (registered device information) data, and the like. Not all traffic data can be used to characterize whether a user is legitimate, and thus the class of data that can be used to characterize whether a user is legitimate needs to be extracted from the traffic data as valid feature data.
In one embodiment, the service data includes a plurality of feature variables, and step 302 specifically includes: and removing the characteristic variables with the same variable value from the characteristic variables of the user set to be identified and the legal user set, and forming the effective characteristic data by the characteristic variables with different residual variable values.
The characteristic variable is the data category, the packet name of the access App can be regarded as a characteristic variable, the version number of the access App can be regarded as another characteristic variable, and the operating system version number can also be regarded as a characteristic variable. The server can count the number of categories of each characteristic variable, mark the characteristic variables with the number of categories being greater than 1, and further filter the characteristic variables with the same variable value. For example, if the access app version numbers of all users are the same, the feature variable "access app version number" may be removed because the access app version numbers are the same for all users and cannot be used to characterize whether the users are legitimate. In one embodiment, the remaining feature variables may be considered valid feature data.
In one embodiment, as shown in fig. 5, the step 302 may further include the following steps:
in step 501, counting a first occurrence frequency of each variable value of the characteristic variable in a legal user set and a second occurrence frequency in a user set to be identified;
it should be noted that if a certain characteristic variable can be used to characterize whether a user is legitimate, there should be a large difference between the frequency of occurrence of the certain variable value of the characteristic variable in the legitimate user set and the frequency of occurrence in the user set to be identified. The first frequency of occurrence refers to the frequency of occurrence of a certain variable value in the legal user set divided by the total number of data pieces. The second frequency of occurrence refers to the frequency of occurrence of a certain variable value in the set of users to be identified divided by the total number of data pieces thereof.
In step 502, if the difference between the first occurrence frequency and the second occurrence frequency is greater than a preset range, the feature variable belongs to valid feature data.
For example, if the first frequency of occurrence of the registration time "aaaa" in the legitimate user set differs greatly from the second frequency of occurrence in the user set to be identified, this feature variable of the registration time may be considered to belong to valid feature data. The difference between the first occurrence frequency and the second occurrence frequency is larger than a preset range, and the difference between the occurrence frequency in the legal user set and the user set to be identified can be considered larger, and the characteristic variable belongs to an effective characteristic variable.
In another embodiment, as shown in fig. 6, the step 302 may include:
in step 601, estimating the predicted frequency of each variable value in the user set to be identified according to the frequency of occurrence of the variable value in the legal user set;
specifically, the occurrence frequency of each variable value of the statistical feature variable in the legal user set is predicted by using the following formula:
Figure SMS_2
wherein x represents the frequency of occurrence of a certain variable value in the legal user set, N represents the total number of data bars (the sum of the number of data bars of the user set to be identified and the number of data bars of the legal user set), N represents the number of data bars of the legal user set, and y represents the predicted frequency of the variable value in the user set to be identified.
That is, it is first assumed that the frequency of occurrence of the variable value in the legal user set is the same as the frequency of occurrence of the variable value in the user set to be identified, so that the frequency of occurrence of the variable value in the user set to be identified can be predicted according to the frequency of occurrence of the variable value in the legal user set.
In step 602, the real frequency number of the variable value in the user set to be identified is counted, if the real frequency number is greater than the predicted frequency number and the real frequency number is greater than the first preset value, and the predicted frequency number is less than the second preset value, the feature variable belongs to effective feature data; the first preset value is greater than the second preset value.
The real frequency number refers to the occurrence number of a certain variable value obtained by statistics in a user set to be identified. May be denoted by z. In one embodiment, the first preset value may be 100 and the second preset value may be 10. Wherein the values 10, 100 may be adjusted empirically. The conditions defining the effective features are:
Figure SMS_3
and z > 100 and x < 10. Wherein x represents the occurrence frequency of a certain variable value in the legal user set, y represents the predicted frequency of the variable value in the user set to be identified, and z represents the actual frequency of the variable value in the user set to be identified. />
Figure SMS_4
Indicating that the real frequency is greater than the predicted frequency.
That is, according to the duty ratio of a certain variable value in the legal user set, the predicted frequency of the variable value in the user set to be identified can be predicted, and if the ratio of the actual frequency of the variable value in the user set to be identified to the predicted frequency is greater than 1, and the actual frequency is greater than 100 and the frequency of the variable value in the legal user set is less than 10, the characteristic variable to which the variable value belongs can be considered to belong to the effective characteristic. Thereby screening out effective characteristic data in the service data.
In an exemplary embodiment, as shown in fig. 7, the step 370 specifically includes:
in step 371, an abnormal cluster is selected from the plurality of clusters, where the abnormal cluster is a cluster in which the number of legal users in the plurality of clusters is smaller than a preset threshold.
In step 372, it is verified whether the registration time of the user in the abnormal cluster and the remaining storage space of the device exhibit a negative correlation.
In step 373, if the negative correlation relationship is presented, determining, according to the users in the abnormal cluster, illegal users in the user set to be identified.
It should be noted that, under normal conditions, the remaining storage space of the device is evenly distributed in one day, and does not change from early to late, and the remaining storage space gradually decreases. If there is a situation that the remaining storage space gradually decreases, only a batch false registration of several devices can be described, and as the number of registered accounts increases, certain files are generated and stored in the devices, so that the remaining storage space of the devices gradually decreases.
Wherein, the negative correlation relationship refers to the relationship that the registration time and the residual storage space of the device show decreasing and approaching decreasing. I.e. the remaining storage space of the device decreases with the increase of registration time. As shown in fig. 8, the horizontal axis identifies the account registration time of day and the vertical axis identifies the remaining storage space of the device. If a black office registers fake users in batches using several devices, a diagonal line segment can be obtained in which the remaining storage space of the devices gradually decreases as the registration time increases as shown in fig. 8. Therefore, after the abnormal clusters are screened, whether the users in the abnormal clusters belong to false users registered in batches can be determined by further verifying whether the registration time of the users in the abnormal clusters and the residual storage space of the equipment show a negative correlation relationship, and further the false registered users in the user set to be identified can be determined.
In one embodiment, as shown in fig. 9, the step 373 specifically includes:
in step 901, classifying users with the same total storage space and equipment starting time according to the total storage space and equipment starting time of the users in the abnormal cluster;
wherein the total storage space of the devices is the total storage space of the devices used by the users in the abnormal cluster. The device start-up time is the start-up time of the device used by the user in the abnormal cluster. And classifying the users in the abnormal clusters according to the total storage space of the equipment and the starting time of the equipment for the abnormal clusters with the registration time and the residual storage space of the equipment in a negative correlation relationship. Users with the same total storage space and the same starting time of the equipment are classified. As shown in fig. 10, the user account registration method can be divided into 4 types, and the representative lawless person can use 4 devices for batch registration of the user account.
In step 902, the correlation coefficients of the registration time of each class of users and the remaining storage space of the device are calculated respectively, and if the correlation coefficients meet the specified range, the users contained in the current class belong to illegal users, so that the illegal users in the user set to be identified are obtained.
Referring to the diagonal line segment shown in fig. 10, after the classification in step 901, it is calculated whether the pearson or spearman correlation coefficient of the registration time and the remaining storage space of the device of each class of users belongs to the [ -1, -0.9] closed section, respectively. If the user belongs to the specified range, the user belongs to an illegal user. The new registered account number of ios version equipment on the daily production environment is about 6-8 ten thousand, and by adopting the method provided by the invention, about 2-3 ten thousand false accounts can be identified.
The following is an embodiment of the apparatus of the present disclosure, which may be used to perform the illegal user identification method embodiment based on big data analysis performed by the server 120 of the present disclosure. For details not disclosed in the embodiments of the apparatus of the present disclosure, please refer to an embodiment of an illegal user identification method based on big data analysis of the present disclosure.
Fig. 11 is a block diagram illustrating a big data analysis based illegal user recognition device that may be used in the server 120 of the implementation environment shown in fig. 1 to perform all or part of the steps of the big data analysis based illegal user recognition method shown in any one of fig. 3-7 and 9 according to an exemplary embodiment. As shown in fig. 11, the apparatus includes, but is not limited to: a data acquisition module 1110, a cluster number determination module 1130, a user clustering module 1150, and an abnormal cluster screening module 1170.
A data acquisition module 1110, configured to acquire valid feature data of a user set to be identified and a legal user set;
the cluster number determining module 1130 is configured to cluster the valid feature data of the legal user set to determine a cluster number;
a user clustering module 1150, configured to cluster the valid feature data of the user set to be identified and the legal user set according to the cluster number to obtain a plurality of clusters;
the abnormal cluster screening module 1170 is configured to screen an abnormal cluster from the plurality of clusters, where the abnormal cluster is a cluster in which the number of legal users in the plurality of clusters is smaller than a preset threshold, and confirm that the users to be identified are concentrated, and the users clustered to the abnormal cluster are illegal users.
The implementation process of the functions and actions of each module in the device is specifically detailed in the implementation process of corresponding steps in the illegal user identification method based on big data analysis, and is not repeated here.
The data acquisition module 1110 may be, for example, a physical structure wired or wireless network interface 250 of fig. 2.
The cluster number determining module 1130, the user clustering module 1150, and the abnormal cluster screening module 1170 may also be functional modules configured to perform corresponding steps in the illegal user identification method based on big data analysis. It is to be understood that these modules may be implemented in hardware, software, or a combination of both. When implemented in hardware, these modules may be implemented as one or more hardware modules, such as one or more application specific integrated circuits. When implemented in software, the modules may be implemented as one or more computer programs executing on one or more processors, such as the program stored in memory 232 executed by central processor 222 of fig. 2.
Optionally, the present disclosure further provides an electronic device, which may be used in the server 120 of the implementation environment shown in fig. 1, to perform all or part of the steps of the illegal user identification method based on big data analysis shown in any of fig. 3-7 and 9. The electronic device includes:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to perform the illegal user identification method based on big data analysis according to the above-described exemplary embodiment.
The specific manner in which the processor of the electronic device performs the operations in this embodiment has been described in detail in relation to this embodiment of the big data analysis based illegal user identification method, and will not be described in detail here.
In an exemplary embodiment, a storage medium is also provided, which is a computer-readable storage medium, such as may be a transitory and non-transitory computer-readable storage medium including instructions. The storage medium stores a computer program executable by the central processor 222 of the server 200 to perform the above-described illegal user identification method based on big data analysis.
It is to be understood that the invention is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims (9)

1. An illegal user identification method based on big data analysis is characterized by comprising the following steps:
acquiring effective characteristic data of a user set to be identified and a legal user set;
clustering the effective characteristic data of the legal user set to determine the cluster number;
clustering the effective characteristic data of the user set to be identified and the legal user set according to the cluster number to obtain a plurality of clusters;
screening abnormal clusters from the plurality of clusters, wherein the abnormal clusters are clusters with the legal users in the plurality of clusters less than a preset threshold;
verifying whether the registration time of the user in the abnormal cluster and the residual storage space of the equipment show a negative correlation relationship or not;
and if the negative correlation relationship is presented, determining illegal users in the user set to be identified according to the users in the abnormal cluster.
2. The method of claim 1, wherein prior to the obtaining valid signature data for the set of users to be identified and the set of legitimate users, the method further comprises:
acquiring service data of a user set to be identified and a legal user set;
and extracting effective characteristics of the service data of the user set to be identified and the legal user set to obtain the effective characteristic data of the user set to be identified and the legal user set.
3. The method according to claim 2, wherein the service data includes a plurality of feature variables, the extracting effective features of the service data of the to-be-identified user set and the legal user set to obtain effective feature data of the to-be-identified user set and the legal user set includes:
and removing the characteristic variables with the same variable value from the characteristic variables of the user set to be identified and the legal user set, and forming the effective characteristic data by the characteristic variables with different residual variable values.
4. The method according to claim 2, wherein the service data includes a plurality of feature variables, the effective feature extraction is performed on the service data of the to-be-identified user set and the legal user set to obtain the effective feature data of the to-be-identified user set and the legal user set, and the method further includes:
counting a first occurrence frequency of each variable value of the characteristic variable in a legal user set and a second occurrence frequency of each variable value in a user set to be identified;
if the difference between the first occurrence frequency and the second occurrence frequency is larger than a preset range, the characteristic variable belongs to effective characteristic data.
5. The method according to claim 2, wherein the service data includes a plurality of feature variables, the effective feature extraction is performed on the service data of the to-be-identified user set and the legal user set to obtain the effective feature data of the to-be-identified user set and the legal user set, and the method further includes:
estimating the predicted frequency of each variable value in the user set to be identified according to the frequency of each variable value in the legal user set;
counting the real frequency number of the variable value in the user set to be identified, and if the real frequency number is larger than a predicted frequency number and the real frequency number is larger than a first preset value and the predicted frequency number is smaller than a second preset value, the characteristic variable belongs to effective characteristic data; wherein the first preset value is greater than the second preset value.
6. The method according to claim 1, wherein the determining, according to the users in the abnormal cluster, illegal users in the set of users to be identified includes:
classifying the users with the same total storage space and the same equipment starting time into a class according to the total storage space and the equipment starting time of the users in the abnormal cluster;
and calculating the correlation coefficient of the registration time of each type of user and the residual storage space of the equipment respectively, and if the correlation coefficient meets the specified range, the users contained in the current category belong to illegal users, so that the illegal users in the user set to be identified are obtained.
7. An illegal user recognition device based on big data analysis, comprising:
the data acquisition module is used for acquiring effective characteristic data of the user set to be identified and the legal user set;
the cluster number determining module is used for clustering the effective characteristic data of the legal user set to determine the cluster number;
the user clustering module is used for clustering the effective characteristic data of the user set to be identified and the legal user set according to the cluster number to obtain a plurality of cluster clusters;
the abnormal cluster screening module is used for screening abnormal clusters from the plurality of clusters, wherein the abnormal clusters are clusters with the legal users less than a preset threshold value in the plurality of clusters, the user set to be identified is confirmed, and the users clustered to the abnormal clusters are illegal users;
the abnormal cluster screening module comprises:
verifying whether the registration time of the user in the abnormal cluster and the residual storage space of the equipment show a negative correlation relationship or not;
and if the negative correlation relationship is presented, determining illegal users in the user set to be identified according to the users in the abnormal cluster.
8. An electronic device, the electronic device comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to perform the big data analysis based illegal user identification method of any of claims 1-6.
9. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program executable by a processor to perform the big data analysis based illegal user identification method according to any of the claims 1-6.
CN201811120248.6A 2018-09-25 2018-09-25 Illegal user identification method and device based on big data analysis and electronic equipment Active CN109284380B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201811120248.6A CN109284380B (en) 2018-09-25 2018-09-25 Illegal user identification method and device based on big data analysis and electronic equipment
PCT/CN2018/125248 WO2020062690A1 (en) 2018-09-25 2018-12-29 Method and apparatus for illegal user identification based on big data analysis, and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811120248.6A CN109284380B (en) 2018-09-25 2018-09-25 Illegal user identification method and device based on big data analysis and electronic equipment

Publications (2)

Publication Number Publication Date
CN109284380A CN109284380A (en) 2019-01-29
CN109284380B true CN109284380B (en) 2023-04-25

Family

ID=65182106

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811120248.6A Active CN109284380B (en) 2018-09-25 2018-09-25 Illegal user identification method and device based on big data analysis and electronic equipment

Country Status (2)

Country Link
CN (1) CN109284380B (en)
WO (1) WO2020062690A1 (en)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111613049B (en) * 2019-02-26 2022-07-12 北京嘀嘀无限科技发展有限公司 Road state monitoring method and device
CN109831454B (en) * 2019-03-13 2022-02-25 北京深演智能科技股份有限公司 False traffic identification method and device
CN110348526B (en) * 2019-07-15 2021-05-07 武汉绿色网络信息服务有限责任公司 Equipment type identification method and device based on semi-supervised clustering algorithm
CN111046388B (en) * 2019-12-16 2022-09-13 北京智游网安科技有限公司 Method for identifying third-party SDK in application, intelligent terminal and storage medium
CN113190646B (en) * 2020-01-14 2024-05-07 北京达佳互联信息技术有限公司 User name sample labeling method and device, electronic equipment and storage medium
CN111260220B (en) * 2020-01-16 2021-05-14 北京房江湖科技有限公司 Group control equipment identification method and device, electronic equipment and storage medium
CN113472627B (en) * 2020-03-31 2023-04-25 阿里巴巴集团控股有限公司 E-mail processing method, device and equipment
CN111506615A (en) * 2020-04-22 2020-08-07 深圳前海微众银行股份有限公司 Method and device for determining occupation degree of invalid user
CN111626754B (en) * 2020-05-28 2023-07-07 中国联合网络通信集团有限公司 Card-keeping user identification method and device
CN111814064A (en) * 2020-06-24 2020-10-23 平安科技(深圳)有限公司 Abnormal user processing method and device based on Neo4j, computer equipment and medium
CN112529051B (en) * 2020-11-25 2024-04-09 微梦创科网络科技(中国)有限公司 Brush amount user identification method and device
CN113114770B (en) * 2021-04-14 2022-08-09 每日互动股份有限公司 User identification method, electronic device, and computer-readable storage medium
CN113222736A (en) * 2021-05-24 2021-08-06 北京城市网邻信息技术有限公司 Abnormal user detection method and device, electronic equipment and storage medium
CN113779568A (en) * 2021-09-18 2021-12-10 中国平安人寿保险股份有限公司 Abnormal behavior user identification method, device, equipment and storage medium
CN115408586B (en) * 2022-08-25 2024-01-23 广东博成网络科技有限公司 Intelligent channel operation data analysis method, system, equipment and storage medium

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103150374A (en) * 2013-03-11 2013-06-12 中国科学院信息工程研究所 Method and system for identifying abnormal microblog users
CN104917739A (en) * 2014-03-14 2015-09-16 腾讯科技(北京)有限公司 False account identification method and device
CN105630885A (en) * 2015-12-18 2016-06-01 国网福建省电力有限公司泉州供电公司 Abnormal power consumption detection method and system
US9367872B1 (en) * 2014-12-22 2016-06-14 Palantir Technologies Inc. Systems and user interfaces for dynamic and interactive investigation of bad actor behavior based on automatic clustering of related data in various data structures
CN106254153A (en) * 2016-09-19 2016-12-21 腾讯科技(深圳)有限公司 A kind of Network Abnormal monitoring method and apparatus
CN106294508A (en) * 2015-06-10 2017-01-04 深圳市腾讯计算机系统有限公司 A kind of brush amount tool detection method and device
CN106469276A (en) * 2015-08-19 2017-03-01 阿里巴巴集团控股有限公司 The kind identification method of data sample and device
CN107465648A (en) * 2016-06-06 2017-12-12 腾讯科技(深圳)有限公司 The recognition methods of warping apparatus and device
CN107517394A (en) * 2017-09-01 2017-12-26 北京小米移动软件有限公司 Identify the method, apparatus and computer-readable recording medium of disabled user
CN108197958A (en) * 2018-01-23 2018-06-22 北京小米移动软件有限公司 Count the method, apparatus and storage medium of ox under line
CN108269012A (en) * 2018-01-12 2018-07-10 中国平安人寿保险股份有限公司 Construction method, device, storage medium and the terminal of risk score model
CN108540431A (en) * 2017-03-03 2018-09-14 阿里巴巴集团控股有限公司 The recognition methods of account type, device and system

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9595006B2 (en) * 2013-06-04 2017-03-14 International Business Machines Corporation Detecting electricity theft via meter tampering using statistical methods
JP7057913B2 (en) * 2016-06-09 2022-04-21 株式会社島津製作所 Big data analysis method and mass spectrometry system using the analysis method
CN108229963B (en) * 2016-12-12 2021-07-30 创新先进技术有限公司 Risk identification method and device for user operation behaviors

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103150374A (en) * 2013-03-11 2013-06-12 中国科学院信息工程研究所 Method and system for identifying abnormal microblog users
CN104917739A (en) * 2014-03-14 2015-09-16 腾讯科技(北京)有限公司 False account identification method and device
US9367872B1 (en) * 2014-12-22 2016-06-14 Palantir Technologies Inc. Systems and user interfaces for dynamic and interactive investigation of bad actor behavior based on automatic clustering of related data in various data structures
CN106294508A (en) * 2015-06-10 2017-01-04 深圳市腾讯计算机系统有限公司 A kind of brush amount tool detection method and device
CN106469276A (en) * 2015-08-19 2017-03-01 阿里巴巴集团控股有限公司 The kind identification method of data sample and device
CN105630885A (en) * 2015-12-18 2016-06-01 国网福建省电力有限公司泉州供电公司 Abnormal power consumption detection method and system
CN107465648A (en) * 2016-06-06 2017-12-12 腾讯科技(深圳)有限公司 The recognition methods of warping apparatus and device
CN106254153A (en) * 2016-09-19 2016-12-21 腾讯科技(深圳)有限公司 A kind of Network Abnormal monitoring method and apparatus
CN108540431A (en) * 2017-03-03 2018-09-14 阿里巴巴集团控股有限公司 The recognition methods of account type, device and system
CN107517394A (en) * 2017-09-01 2017-12-26 北京小米移动软件有限公司 Identify the method, apparatus and computer-readable recording medium of disabled user
CN108269012A (en) * 2018-01-12 2018-07-10 中国平安人寿保险股份有限公司 Construction method, device, storage medium and the terminal of risk score model
CN108197958A (en) * 2018-01-23 2018-06-22 北京小米移动软件有限公司 Count the method, apparatus and storage medium of ox under line

Also Published As

Publication number Publication date
CN109284380A (en) 2019-01-29
WO2020062690A1 (en) 2020-04-02

Similar Documents

Publication Publication Date Title
CN109284380B (en) Illegal user identification method and device based on big data analysis and electronic equipment
CN111159243B (en) User type identification method, device, equipment and storage medium
JP2022518469A (en) Information processing methods and devices, storage media
CN106960153B (en) Virus type identification method and device
CN107222511B (en) Malicious software detection method and device, computer device and readable storage medium
CN109325548B (en) Image processing method, image processing device, electronic equipment and storage medium
CN103297267A (en) Method and system for network behavior risk assessment
CN110995745B (en) Method and device for separating and identifying illegal machine card of Internet of things
CN111542043B (en) Method and device for identifying service request for changing mobile phone number
CN111064719B (en) Method and device for detecting abnormal downloading behavior of file
CN111353138A (en) Abnormal user identification method and device, electronic equipment and storage medium
CN109905524B (en) Telephone number identification method and device, computer equipment and computer storage medium
CN108076032B (en) Abnormal behavior user identification method and device
CN113727348B (en) Method, device, system and storage medium for detecting user data of User Equipment (UE)
CN111371581A (en) Method, device, equipment and medium for detecting business abnormity of Internet of things card
CN111178347B (en) Ambiguity detection method, ambiguity detection device, ambiguity detection equipment and ambiguity detection storage medium for certificate image
CN109447177B (en) Account clustering method and device and server
CN109951609B (en) Malicious telephone number processing method and device
CN113051601A (en) Sensitive data identification method, device, equipment and medium
CN110909263A (en) Method and device for determining companion relationship of identity characteristics
Di Domenico et al. Classification of heterogenous M2M/IoT traffic based on C-plane and U-plane data
CN114048344A (en) Similar face searching method, device, equipment and readable storage medium
CN113901417A (en) Mobile equipment fingerprint generation method and readable storage medium
CN112751813A (en) Network intrusion detection method and device
CN110944290A (en) Companion relationship analysis method and apparatus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant