CN109284380B - Illegal user identification method and device based on big data analysis and electronic equipment - Google Patents
Illegal user identification method and device based on big data analysis and electronic equipment Download PDFInfo
- Publication number
- CN109284380B CN109284380B CN201811120248.6A CN201811120248A CN109284380B CN 109284380 B CN109284380 B CN 109284380B CN 201811120248 A CN201811120248 A CN 201811120248A CN 109284380 B CN109284380 B CN 109284380B
- Authority
- CN
- China
- Prior art keywords
- user set
- identified
- users
- clusters
- legal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/30—Authentication, i.e. establishing the identity or authorisation of security principals
- G06F21/31—User authentication
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- Theoretical Computer Science (AREA)
- Computer Hardware Design (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Debugging And Monitoring (AREA)
Abstract
The disclosure relates to the technical field of big data, and provides an illegal user identification method and device based on big data analysis, electronic equipment and a computer readable storage medium, wherein the method comprises the following steps: acquiring effective characteristic data of a user set to be identified and a legal user set; clustering the effective characteristic data of the legal user set to determine the cluster number; clustering the effective characteristic data of the user set to be identified and the legal user set according to the number of the clusters to obtain a plurality of clusters; and screening abnormal clusters from the plurality of clusters, wherein the abnormal clusters are clusters with the number of legal users smaller than a preset threshold value in the plurality of clusters, and the users to be identified are confirmed to be illegal users in the user set. According to the technical scheme provided by the invention, the false registered users can be identified in batches by adopting a clustering mode, so that the identification efficiency is improved, and the false registered users are not identified by adopting a behavior feature matching mode, so that the identification accuracy is improved.
Description
Technical Field
The disclosure relates to the technical field of big data, in particular to an illegal user identification method and device based on big data analysis, electronic equipment and a computer readable storage medium.
Background
Currently, the popularity of smart terminals such as smartphones provides a carrier for various types of APP (Application). Numerous dead users, or numerous users present and active specifically for brushing, are all overactive on various types of APP, both of which are users with false registration, the presence of which interferes with the normal order on the network on the one hand and causes waste of resources on the other hand.
Aiming at false registered users existing in the current situation, the traditional method is to manually judge and delete the false users, so that the working efficiency is lower. The existing method is to analyze and summarize behavior characteristics of false registered users to form a characteristic library, and further to determine whether unknown users belong to false users or not in a behavior characteristic matching mode. Because the feature library has a limited data size and is updated slowly, a fake registered user may bypass features recorded in the feature library, and thus the fake registered user cannot be accurately identified.
Disclosure of Invention
In order to solve the problem that false registered users cannot be accurately identified in the related art, the disclosure provides an illegal user identification method based on big data analysis.
In one aspect, the invention provides an illegal user identification method based on big data analysis, which comprises the following steps:
acquiring effective characteristic data of a user set to be identified and a legal user set;
clustering the effective characteristic data of the legal user set to determine the cluster number;
clustering the effective characteristic data of the user set to be identified and the legal user set according to the cluster number to obtain a plurality of clusters;
and screening abnormal clusters from the plurality of clusters, wherein the abnormal clusters are clusters with the number of legal users smaller than a preset threshold value in the plurality of clusters, and the users to be identified are confirmed to be illegal users in the user set.
Optionally, before the obtaining the valid feature data of the user set to be identified and the legal user set, the method further includes:
acquiring service data of a user set to be identified and a legal user set;
and extracting effective characteristics of the service data of the user set to be identified and the legal user set to obtain the effective characteristic data of the user set to be identified and the legal user set.
Optionally, the service data includes a plurality of feature variables, and the extracting the effective features of the service data of the to-be-identified user set and the legal user set to obtain the effective feature data of the to-be-identified user set and the legal user set includes:
and removing the characteristic variables with the same variable value from the characteristic variables of the user set to be identified and the legal user set, and forming the effective characteristic data by the characteristic variables with different residual variable values.
Optionally, the service data includes a plurality of feature variables, and the extracting effective features of the service data of the to-be-identified user set and the legal user set to obtain effective feature data of the to-be-identified user set and the legal user set, and further includes:
counting a first occurrence frequency of each variable value of the characteristic variable in a legal user set and a second occurrence frequency of each variable value in a user set to be identified;
if the difference between the first occurrence frequency and the second occurrence frequency is larger than a preset range, the characteristic variable belongs to effective characteristic data.
Optionally, the service data includes a plurality of feature variables, and the extracting effective features of the service data of the to-be-identified user set and the legal user set to obtain effective feature data of the to-be-identified user set and the legal user set, and further includes:
estimating the predicted frequency of each variable value in the user set to be identified according to the frequency of each variable value in the legal user set;
counting the real frequency number of the variable value in the user set to be identified, and if the real frequency number is larger than a predicted frequency number and the real frequency number is larger than a first preset value and the predicted frequency number is smaller than a second preset value, the characteristic variable belongs to effective characteristic data; wherein the first preset value is greater than the second preset value.
Optionally, an abnormal cluster is selected from the plurality of clusters, where the abnormal cluster is a cluster in which the number of legal users in the plurality of clusters is smaller than a preset threshold, and the user to be identified is confirmed to be concentrated, and the user clustered to the abnormal cluster is an illegal user, including:
screening abnormal clusters from the plurality of clusters, wherein the abnormal clusters are clusters with the legal users in the plurality of clusters less than a preset threshold;
verifying whether the registration time of the user in the abnormal cluster and the residual storage space of the equipment show a negative correlation relationship or not;
and if the negative correlation relationship is presented, determining illegal users in the user set to be identified according to the users in the abnormal cluster.
Optionally, the determining, according to the users in the abnormal cluster, illegal users in the user set to be identified includes:
classifying the users with the same total storage space and the same equipment starting time into a class according to the total storage space and the equipment starting time of the users in the abnormal cluster;
and calculating the correlation coefficient of the registration time of each type of user and the residual storage space of the equipment respectively, and if the correlation coefficient meets the specified range, the users contained in the current category belong to illegal users, so that the illegal users in the user set to be identified are obtained.
On the other hand, the invention also provides an illegal user identification device based on big data analysis, which comprises:
the data acquisition module is used for acquiring effective characteristic data of the user set to be identified and the legal user set;
the cluster number determining module is used for clustering the effective characteristic data of the legal user set to determine the cluster number;
the user clustering module is used for clustering the effective characteristic data of the user set to be identified and the legal user set according to the cluster number to obtain a plurality of cluster clusters;
the abnormal cluster screening module is used for screening abnormal clusters from the plurality of clusters, wherein the abnormal clusters are clusters with the legal users less than a preset threshold value in the plurality of clusters, the user set to be identified is confirmed, and the users clustered to the abnormal clusters are illegal users.
In addition, the invention also provides electronic equipment, which comprises:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to perform the above-described big data analysis based illegal user identification method.
Furthermore, the invention also provides a computer readable storage medium storing a computer program executable by a processor to perform the illegal user identification method based on big data analysis.
The technical scheme provided by the embodiment of the disclosure can comprise the following beneficial effects:
according to the technical scheme provided by the invention, the effective characteristic data of the legal user set is clustered, so that the proper cluster number can be determined, the effective characteristic data of the user set to be identified and the legal user set are clustered according to the cluster number, the cluster class with the smaller number of legal users can be regarded as an abnormal cluster, and further, the users classified into the abnormal cluster to be identified can be regarded as illegal users. According to the technical scheme provided by the invention, the false registered users can be identified in batches by adopting a clustering mode, so that the identification efficiency is improved, and the false registered users are not identified by adopting a behavior feature matching mode, so that the identification accuracy is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
FIG. 1 is a schematic illustration of an implementation environment in accordance with the present disclosure;
FIG. 2 is a block diagram of a server shown in accordance with an exemplary embodiment;
FIG. 3 is a flow chart illustrating a method of illegal user identification based on big data analysis, according to an exemplary embodiment;
FIG. 4 is a flow chart of an illegal user identification method based on big data analysis, which is shown in another exemplary embodiment based on the corresponding embodiment of FIG. 3;
FIG. 5 is a detailed flowchart of step 302 in the corresponding embodiment of FIG. 4;
FIG. 6 is a detailed flow chart of step 302 in the corresponding embodiment of FIG. 4;
FIG. 7 is a detailed flowchart of step 370 in the corresponding embodiment of FIG. 3;
FIG. 8 is a schematic diagram showing a negative correlation between registration time and remaining memory of a device;
FIG. 9 is a detailed flowchart of step 373 in the corresponding embodiment of FIG. 7;
FIG. 10 is a schematic diagram of registration time and device remaining memory space relationship for 4 devices performing a batch false registration;
fig. 11 is a block diagram illustrating an illegal user recognition device based on big data analysis according to an exemplary embodiment.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the invention. Rather, they are merely examples of apparatus and methods consistent with aspects of the invention as detailed in the accompanying claims.
Fig. 1 is a schematic diagram of an implementation environment in accordance with the present disclosure. The implementation environment comprises: a plurality of mobile terminals 110 and a server 120.
A plurality of mobile terminals 110 and a server 120 are connected by a wired or wireless network. Each mobile terminal 110 requests a user account registration from the server 120 by running the software APP. The server 120 may employ the scheme provided by the present invention to identify illegitimate users (including falsely registered users).
It should be noted that the illegal user identification method based on big data analysis provided by the present invention is not limited to the corresponding processing logic deployed in the server 120, but may be processing logic deployed in other machines. For example, the processing logic of the illegal user identification method of the present invention is deployed in a terminal device with computing capability.
Referring to fig. 2, fig. 2 is a schematic diagram of a server according to an embodiment of the present invention. The server 200 may vary considerably in configuration or performance and may include one or more central processing units (central processing units, CPU) 222 (e.g., one or more processors) and memory 232, one or more storage media 230 (e.g., one or more mass storage devices) storing applications 242 or data 244. Wherein the memory 232 and storage medium 230 may be transitory or persistent. The program stored in the storage medium 230 may include one or more modules (not shown in the drawing), each of which may include a series of instruction operations in the server 200. Still further, the central processor 222 may be configured to communicate with the storage medium 230 and execute a series of instruction operations in the storage medium 230 on the server 200. The Server 200 may also include one or more power supplies 226, one or more wired or wireless network interfaces 250, one or more input/output interfaces 258, and/or one or more operating systems 241, such as Windows Server TM ,Mac OS XTM ,UnixTM,Linux TM ,FreeBSD TM Etc. The steps performed by the server described in the embodiments shown in fig. 3-7 and 9 below may be based on the server structure shown in fig. 2.
Those of ordinary skill in the art will appreciate that all or a portion of the steps implementing the embodiments described below may be implemented by hardware, or may be implemented by a program for instructing the relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
Fig. 3 is a flow chart illustrating a method of illegal user identification based on big data analysis, according to an exemplary embodiment. The application range and the execution subject of the illegal user identification method based on big data analysis may be a server, and the server may be the server 120 of the implementation environment shown in fig. 1. As shown in fig. 3, the method may include the following steps.
In step 310, valid feature data for the set of users to be identified as well as the set of legitimate users is obtained.
Wherein, illegal users refer to users who abnormally use APP (such as Jin Guangu APP) and are opposite to legal users. The illegal user may be a user generated by false registration in a black industry lot. The set of users to be identified comprises a plurality of users but the validity of the users is unknown, i.e. whether the users belong to false registration or not is not determined yet. The legal user set refers to determining a plurality of users belonging to the APP that are normally registered and used. The set of legitimate users may be a set of whitelisted users such as formal business personnel, life insurance personnel, policy users, fund users, and the like. The effective feature data is used to characterize basic information of the user, such as location information, device information, registered phone number, registration time, etc.
In step 330, the valid feature data of the legal user set is clustered to determine the number of clusters.
Wherein the process of dividing a collection of physical or abstract objects into classes consisting of similar objects is called clustering. The cluster number refers to the number of classes composed of similar objects, i.e., the number of classified classes. Specifically, the k-means clustering algorithm can be adopted to cluster the effective characteristic data of the legal user set, and when different cluster numbers are calculated respectively through traversing the cluster numbers, namely trying to gather into 2 types, 3 types, 4 types and the like, the total intra-cluster variation sum of the legal user set is calculated. Wherein, the total intra-cluster variance and formula are calculated as follows:
wherein S represents the total intra-cluster variance sum; m represents the number of clusters; p represents a sample instance in the legal user set; c i Is the center of cluster i; d (x, y) represents the Euclidean distance of two points x and y.
Taking the cluster number with the minimum S value as the most suitable cluster number. The total intra-cluster variation sum is used for representing the sum of the similarities in all clusters, and when the total intra-cluster variation sum is minimum, the sum of the similarities in all clusters reaches the highest, namely similar users are clustered into the same cluster, dissimilar users are in another cluster, and the number of clusters reached at the moment can be considered as the most suitable number of clusters.
In step 350, the valid feature data of the user set to be identified and the legal user set are clustered according to the cluster number, so as to obtain a plurality of clusters.
Specifically, a k-means clustering algorithm may be used to cluster the valid feature data of the user set to be identified and the legal user set according to the most suitable cluster number determined in step 330. For example, when the number of clusters is 4, the total cluster becomes worse and smallest, so that all users can be divided into 4 clusters according to the effective feature data of the user set to be identified and the effective feature data of the legal user set. It should be noted that a cluster generated by clustering is a set of data objects, which are similar to objects in the same cluster and different from objects in other clusters.
In step 370, an abnormal cluster is selected from the plurality of clusters, where the abnormal cluster is a cluster in which the number of legal users in the plurality of clusters is smaller than a preset threshold, and it is confirmed that the users to be identified are concentrated, and the users clustered to the abnormal cluster are illegal users.
It should be explained that, because users in a cluster have higher similarity, users in different clusters have higher variability. All clusters without legal users or with very few legal users (less than a certain preset threshold value) are abnormal clusters. That is, no legitimate user is responsible for the legitimacy of such clusters, considered an outlier cluster. Further, the users in the set of users to be identified classified into the abnormal cluster may be regarded as illegal users.
According to the technical scheme provided by the exemplary embodiment of the invention, the effective characteristic data of the legal user set is clustered, so that the proper cluster number can be determined, the effective characteristic data of the user set to be identified and the legal user set are clustered according to the cluster number, the cluster class with the smaller number of legal users can be regarded as an abnormal cluster, and further, the users classified into the abnormal cluster to be identified can be regarded as illegal users. According to the technical scheme provided by the invention, the false registered users can be identified in batches by adopting a clustering mode, so that the identification efficiency is improved, and the false registered users are not identified by adopting a behavior feature matching mode, so that the identification accuracy is improved.
In an exemplary embodiment, as shown in fig. 4, before the step 310, the illegal user identification method based on big data analysis provided by the present invention further includes the following steps:
in step 301, service data of a user set to be identified and a legal user set are acquired;
the service data includes a registered phone number, a registration time, sdk (registered device information) data, and the like. Sdk data includes: the packet name of the access App, the version number of the access App, the operating system version number, latitude and longitude information, a SIM (subscriber identity module) card serial number, an IMSI (international mobile subscriber identity), an IMEI (international mobile equipment identity), a Mac address of the device, and the like. Further, the service data may further include data derived from the above data, such as GPS data latitude and longitude information, a mobile phone number carrier number segment (three digits before the mobile phone number), digits from the 4 th to the 7 th digits of the mobile phone number, whether the attribution of the carrier is consistent, whether the network type is wifi but the wifi name of the connection is null, the first half data of ip, and the battery power level. Abnormal data and missing data can be filtered out as required, and the legal user account is marked as 1, otherwise, the legal user account is marked as 0.
In step 302, effective feature extraction is performed on the service data of the user set to be identified and the legal user set, so as to obtain effective feature data of the user set to be identified and the legal user set.
The service data includes a plurality of data types, and as described above, the service data includes registered mobile phone numbers, registration time, sdk (registered device information) data, and the like. Not all traffic data can be used to characterize whether a user is legitimate, and thus the class of data that can be used to characterize whether a user is legitimate needs to be extracted from the traffic data as valid feature data.
In one embodiment, the service data includes a plurality of feature variables, and step 302 specifically includes: and removing the characteristic variables with the same variable value from the characteristic variables of the user set to be identified and the legal user set, and forming the effective characteristic data by the characteristic variables with different residual variable values.
The characteristic variable is the data category, the packet name of the access App can be regarded as a characteristic variable, the version number of the access App can be regarded as another characteristic variable, and the operating system version number can also be regarded as a characteristic variable. The server can count the number of categories of each characteristic variable, mark the characteristic variables with the number of categories being greater than 1, and further filter the characteristic variables with the same variable value. For example, if the access app version numbers of all users are the same, the feature variable "access app version number" may be removed because the access app version numbers are the same for all users and cannot be used to characterize whether the users are legitimate. In one embodiment, the remaining feature variables may be considered valid feature data.
In one embodiment, as shown in fig. 5, the step 302 may further include the following steps:
in step 501, counting a first occurrence frequency of each variable value of the characteristic variable in a legal user set and a second occurrence frequency in a user set to be identified;
it should be noted that if a certain characteristic variable can be used to characterize whether a user is legitimate, there should be a large difference between the frequency of occurrence of the certain variable value of the characteristic variable in the legitimate user set and the frequency of occurrence in the user set to be identified. The first frequency of occurrence refers to the frequency of occurrence of a certain variable value in the legal user set divided by the total number of data pieces. The second frequency of occurrence refers to the frequency of occurrence of a certain variable value in the set of users to be identified divided by the total number of data pieces thereof.
In step 502, if the difference between the first occurrence frequency and the second occurrence frequency is greater than a preset range, the feature variable belongs to valid feature data.
For example, if the first frequency of occurrence of the registration time "aaaa" in the legitimate user set differs greatly from the second frequency of occurrence in the user set to be identified, this feature variable of the registration time may be considered to belong to valid feature data. The difference between the first occurrence frequency and the second occurrence frequency is larger than a preset range, and the difference between the occurrence frequency in the legal user set and the user set to be identified can be considered larger, and the characteristic variable belongs to an effective characteristic variable.
In another embodiment, as shown in fig. 6, the step 302 may include:
in step 601, estimating the predicted frequency of each variable value in the user set to be identified according to the frequency of occurrence of the variable value in the legal user set;
specifically, the occurrence frequency of each variable value of the statistical feature variable in the legal user set is predicted by using the following formula:
wherein x represents the frequency of occurrence of a certain variable value in the legal user set, N represents the total number of data bars (the sum of the number of data bars of the user set to be identified and the number of data bars of the legal user set), N represents the number of data bars of the legal user set, and y represents the predicted frequency of the variable value in the user set to be identified.
That is, it is first assumed that the frequency of occurrence of the variable value in the legal user set is the same as the frequency of occurrence of the variable value in the user set to be identified, so that the frequency of occurrence of the variable value in the user set to be identified can be predicted according to the frequency of occurrence of the variable value in the legal user set.
In step 602, the real frequency number of the variable value in the user set to be identified is counted, if the real frequency number is greater than the predicted frequency number and the real frequency number is greater than the first preset value, and the predicted frequency number is less than the second preset value, the feature variable belongs to effective feature data; the first preset value is greater than the second preset value.
The real frequency number refers to the occurrence number of a certain variable value obtained by statistics in a user set to be identified. May be denoted by z. In one embodiment, the first preset value may be 100 and the second preset value may be 10. Wherein the values 10, 100 may be adjusted empirically. The conditions defining the effective features are:and z > 100 and x < 10. Wherein x represents the occurrence frequency of a certain variable value in the legal user set, y represents the predicted frequency of the variable value in the user set to be identified, and z represents the actual frequency of the variable value in the user set to be identified. />Indicating that the real frequency is greater than the predicted frequency.
That is, according to the duty ratio of a certain variable value in the legal user set, the predicted frequency of the variable value in the user set to be identified can be predicted, and if the ratio of the actual frequency of the variable value in the user set to be identified to the predicted frequency is greater than 1, and the actual frequency is greater than 100 and the frequency of the variable value in the legal user set is less than 10, the characteristic variable to which the variable value belongs can be considered to belong to the effective characteristic. Thereby screening out effective characteristic data in the service data.
In an exemplary embodiment, as shown in fig. 7, the step 370 specifically includes:
in step 371, an abnormal cluster is selected from the plurality of clusters, where the abnormal cluster is a cluster in which the number of legal users in the plurality of clusters is smaller than a preset threshold.
In step 372, it is verified whether the registration time of the user in the abnormal cluster and the remaining storage space of the device exhibit a negative correlation.
In step 373, if the negative correlation relationship is presented, determining, according to the users in the abnormal cluster, illegal users in the user set to be identified.
It should be noted that, under normal conditions, the remaining storage space of the device is evenly distributed in one day, and does not change from early to late, and the remaining storage space gradually decreases. If there is a situation that the remaining storage space gradually decreases, only a batch false registration of several devices can be described, and as the number of registered accounts increases, certain files are generated and stored in the devices, so that the remaining storage space of the devices gradually decreases.
Wherein, the negative correlation relationship refers to the relationship that the registration time and the residual storage space of the device show decreasing and approaching decreasing. I.e. the remaining storage space of the device decreases with the increase of registration time. As shown in fig. 8, the horizontal axis identifies the account registration time of day and the vertical axis identifies the remaining storage space of the device. If a black office registers fake users in batches using several devices, a diagonal line segment can be obtained in which the remaining storage space of the devices gradually decreases as the registration time increases as shown in fig. 8. Therefore, after the abnormal clusters are screened, whether the users in the abnormal clusters belong to false users registered in batches can be determined by further verifying whether the registration time of the users in the abnormal clusters and the residual storage space of the equipment show a negative correlation relationship, and further the false registered users in the user set to be identified can be determined.
In one embodiment, as shown in fig. 9, the step 373 specifically includes:
in step 901, classifying users with the same total storage space and equipment starting time according to the total storage space and equipment starting time of the users in the abnormal cluster;
wherein the total storage space of the devices is the total storage space of the devices used by the users in the abnormal cluster. The device start-up time is the start-up time of the device used by the user in the abnormal cluster. And classifying the users in the abnormal clusters according to the total storage space of the equipment and the starting time of the equipment for the abnormal clusters with the registration time and the residual storage space of the equipment in a negative correlation relationship. Users with the same total storage space and the same starting time of the equipment are classified. As shown in fig. 10, the user account registration method can be divided into 4 types, and the representative lawless person can use 4 devices for batch registration of the user account.
In step 902, the correlation coefficients of the registration time of each class of users and the remaining storage space of the device are calculated respectively, and if the correlation coefficients meet the specified range, the users contained in the current class belong to illegal users, so that the illegal users in the user set to be identified are obtained.
Referring to the diagonal line segment shown in fig. 10, after the classification in step 901, it is calculated whether the pearson or spearman correlation coefficient of the registration time and the remaining storage space of the device of each class of users belongs to the [ -1, -0.9] closed section, respectively. If the user belongs to the specified range, the user belongs to an illegal user. The new registered account number of ios version equipment on the daily production environment is about 6-8 ten thousand, and by adopting the method provided by the invention, about 2-3 ten thousand false accounts can be identified.
The following is an embodiment of the apparatus of the present disclosure, which may be used to perform the illegal user identification method embodiment based on big data analysis performed by the server 120 of the present disclosure. For details not disclosed in the embodiments of the apparatus of the present disclosure, please refer to an embodiment of an illegal user identification method based on big data analysis of the present disclosure.
Fig. 11 is a block diagram illustrating a big data analysis based illegal user recognition device that may be used in the server 120 of the implementation environment shown in fig. 1 to perform all or part of the steps of the big data analysis based illegal user recognition method shown in any one of fig. 3-7 and 9 according to an exemplary embodiment. As shown in fig. 11, the apparatus includes, but is not limited to: a data acquisition module 1110, a cluster number determination module 1130, a user clustering module 1150, and an abnormal cluster screening module 1170.
A data acquisition module 1110, configured to acquire valid feature data of a user set to be identified and a legal user set;
the cluster number determining module 1130 is configured to cluster the valid feature data of the legal user set to determine a cluster number;
a user clustering module 1150, configured to cluster the valid feature data of the user set to be identified and the legal user set according to the cluster number to obtain a plurality of clusters;
the abnormal cluster screening module 1170 is configured to screen an abnormal cluster from the plurality of clusters, where the abnormal cluster is a cluster in which the number of legal users in the plurality of clusters is smaller than a preset threshold, and confirm that the users to be identified are concentrated, and the users clustered to the abnormal cluster are illegal users.
The implementation process of the functions and actions of each module in the device is specifically detailed in the implementation process of corresponding steps in the illegal user identification method based on big data analysis, and is not repeated here.
The data acquisition module 1110 may be, for example, a physical structure wired or wireless network interface 250 of fig. 2.
The cluster number determining module 1130, the user clustering module 1150, and the abnormal cluster screening module 1170 may also be functional modules configured to perform corresponding steps in the illegal user identification method based on big data analysis. It is to be understood that these modules may be implemented in hardware, software, or a combination of both. When implemented in hardware, these modules may be implemented as one or more hardware modules, such as one or more application specific integrated circuits. When implemented in software, the modules may be implemented as one or more computer programs executing on one or more processors, such as the program stored in memory 232 executed by central processor 222 of fig. 2.
Optionally, the present disclosure further provides an electronic device, which may be used in the server 120 of the implementation environment shown in fig. 1, to perform all or part of the steps of the illegal user identification method based on big data analysis shown in any of fig. 3-7 and 9. The electronic device includes:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to perform the illegal user identification method based on big data analysis according to the above-described exemplary embodiment.
The specific manner in which the processor of the electronic device performs the operations in this embodiment has been described in detail in relation to this embodiment of the big data analysis based illegal user identification method, and will not be described in detail here.
In an exemplary embodiment, a storage medium is also provided, which is a computer-readable storage medium, such as may be a transitory and non-transitory computer-readable storage medium including instructions. The storage medium stores a computer program executable by the central processor 222 of the server 200 to perform the above-described illegal user identification method based on big data analysis.
It is to be understood that the invention is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the invention is limited only by the appended claims.
Claims (9)
1. An illegal user identification method based on big data analysis is characterized by comprising the following steps:
acquiring effective characteristic data of a user set to be identified and a legal user set;
clustering the effective characteristic data of the legal user set to determine the cluster number;
clustering the effective characteristic data of the user set to be identified and the legal user set according to the cluster number to obtain a plurality of clusters;
screening abnormal clusters from the plurality of clusters, wherein the abnormal clusters are clusters with the legal users in the plurality of clusters less than a preset threshold;
verifying whether the registration time of the user in the abnormal cluster and the residual storage space of the equipment show a negative correlation relationship or not;
and if the negative correlation relationship is presented, determining illegal users in the user set to be identified according to the users in the abnormal cluster.
2. The method of claim 1, wherein prior to the obtaining valid signature data for the set of users to be identified and the set of legitimate users, the method further comprises:
acquiring service data of a user set to be identified and a legal user set;
and extracting effective characteristics of the service data of the user set to be identified and the legal user set to obtain the effective characteristic data of the user set to be identified and the legal user set.
3. The method according to claim 2, wherein the service data includes a plurality of feature variables, the extracting effective features of the service data of the to-be-identified user set and the legal user set to obtain effective feature data of the to-be-identified user set and the legal user set includes:
and removing the characteristic variables with the same variable value from the characteristic variables of the user set to be identified and the legal user set, and forming the effective characteristic data by the characteristic variables with different residual variable values.
4. The method according to claim 2, wherein the service data includes a plurality of feature variables, the effective feature extraction is performed on the service data of the to-be-identified user set and the legal user set to obtain the effective feature data of the to-be-identified user set and the legal user set, and the method further includes:
counting a first occurrence frequency of each variable value of the characteristic variable in a legal user set and a second occurrence frequency of each variable value in a user set to be identified;
if the difference between the first occurrence frequency and the second occurrence frequency is larger than a preset range, the characteristic variable belongs to effective characteristic data.
5. The method according to claim 2, wherein the service data includes a plurality of feature variables, the effective feature extraction is performed on the service data of the to-be-identified user set and the legal user set to obtain the effective feature data of the to-be-identified user set and the legal user set, and the method further includes:
estimating the predicted frequency of each variable value in the user set to be identified according to the frequency of each variable value in the legal user set;
counting the real frequency number of the variable value in the user set to be identified, and if the real frequency number is larger than a predicted frequency number and the real frequency number is larger than a first preset value and the predicted frequency number is smaller than a second preset value, the characteristic variable belongs to effective characteristic data; wherein the first preset value is greater than the second preset value.
6. The method according to claim 1, wherein the determining, according to the users in the abnormal cluster, illegal users in the set of users to be identified includes:
classifying the users with the same total storage space and the same equipment starting time into a class according to the total storage space and the equipment starting time of the users in the abnormal cluster;
and calculating the correlation coefficient of the registration time of each type of user and the residual storage space of the equipment respectively, and if the correlation coefficient meets the specified range, the users contained in the current category belong to illegal users, so that the illegal users in the user set to be identified are obtained.
7. An illegal user recognition device based on big data analysis, comprising:
the data acquisition module is used for acquiring effective characteristic data of the user set to be identified and the legal user set;
the cluster number determining module is used for clustering the effective characteristic data of the legal user set to determine the cluster number;
the user clustering module is used for clustering the effective characteristic data of the user set to be identified and the legal user set according to the cluster number to obtain a plurality of cluster clusters;
the abnormal cluster screening module is used for screening abnormal clusters from the plurality of clusters, wherein the abnormal clusters are clusters with the legal users less than a preset threshold value in the plurality of clusters, the user set to be identified is confirmed, and the users clustered to the abnormal clusters are illegal users;
the abnormal cluster screening module comprises:
verifying whether the registration time of the user in the abnormal cluster and the residual storage space of the equipment show a negative correlation relationship or not;
and if the negative correlation relationship is presented, determining illegal users in the user set to be identified according to the users in the abnormal cluster.
8. An electronic device, the electronic device comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to perform the big data analysis based illegal user identification method of any of claims 1-6.
9. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program executable by a processor to perform the big data analysis based illegal user identification method according to any of the claims 1-6.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811120248.6A CN109284380B (en) | 2018-09-25 | 2018-09-25 | Illegal user identification method and device based on big data analysis and electronic equipment |
PCT/CN2018/125248 WO2020062690A1 (en) | 2018-09-25 | 2018-12-29 | Method and apparatus for illegal user identification based on big data analysis, and electronic device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811120248.6A CN109284380B (en) | 2018-09-25 | 2018-09-25 | Illegal user identification method and device based on big data analysis and electronic equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109284380A CN109284380A (en) | 2019-01-29 |
CN109284380B true CN109284380B (en) | 2023-04-25 |
Family
ID=65182106
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811120248.6A Active CN109284380B (en) | 2018-09-25 | 2018-09-25 | Illegal user identification method and device based on big data analysis and electronic equipment |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN109284380B (en) |
WO (1) | WO2020062690A1 (en) |
Families Citing this family (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111613049B (en) * | 2019-02-26 | 2022-07-12 | 北京嘀嘀无限科技发展有限公司 | Road state monitoring method and device |
CN109831454B (en) * | 2019-03-13 | 2022-02-25 | 北京深演智能科技股份有限公司 | False traffic identification method and device |
CN110348526B (en) * | 2019-07-15 | 2021-05-07 | 武汉绿色网络信息服务有限责任公司 | Equipment type identification method and device based on semi-supervised clustering algorithm |
CN111046388B (en) * | 2019-12-16 | 2022-09-13 | 北京智游网安科技有限公司 | Method for identifying third-party SDK in application, intelligent terminal and storage medium |
CN113190646B (en) * | 2020-01-14 | 2024-05-07 | 北京达佳互联信息技术有限公司 | User name sample labeling method and device, electronic equipment and storage medium |
CN111260220B (en) * | 2020-01-16 | 2021-05-14 | 北京房江湖科技有限公司 | Group control equipment identification method and device, electronic equipment and storage medium |
CN113472627B (en) * | 2020-03-31 | 2023-04-25 | 阿里巴巴集团控股有限公司 | E-mail processing method, device and equipment |
CN111506615A (en) * | 2020-04-22 | 2020-08-07 | 深圳前海微众银行股份有限公司 | Method and device for determining occupation degree of invalid user |
CN111626754B (en) * | 2020-05-28 | 2023-07-07 | 中国联合网络通信集团有限公司 | Card-keeping user identification method and device |
CN111814064A (en) * | 2020-06-24 | 2020-10-23 | 平安科技(深圳)有限公司 | Abnormal user processing method and device based on Neo4j, computer equipment and medium |
CN112529051B (en) * | 2020-11-25 | 2024-04-09 | 微梦创科网络科技(中国)有限公司 | Brush amount user identification method and device |
CN113114770B (en) * | 2021-04-14 | 2022-08-09 | 每日互动股份有限公司 | User identification method, electronic device, and computer-readable storage medium |
CN113222736A (en) * | 2021-05-24 | 2021-08-06 | 北京城市网邻信息技术有限公司 | Abnormal user detection method and device, electronic equipment and storage medium |
CN113779568A (en) * | 2021-09-18 | 2021-12-10 | 中国平安人寿保险股份有限公司 | Abnormal behavior user identification method, device, equipment and storage medium |
CN115408586B (en) * | 2022-08-25 | 2024-01-23 | 广东博成网络科技有限公司 | Intelligent channel operation data analysis method, system, equipment and storage medium |
Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103150374A (en) * | 2013-03-11 | 2013-06-12 | 中国科学院信息工程研究所 | Method and system for identifying abnormal microblog users |
CN104917739A (en) * | 2014-03-14 | 2015-09-16 | 腾讯科技(北京)有限公司 | False account identification method and device |
CN105630885A (en) * | 2015-12-18 | 2016-06-01 | 国网福建省电力有限公司泉州供电公司 | Abnormal power consumption detection method and system |
US9367872B1 (en) * | 2014-12-22 | 2016-06-14 | Palantir Technologies Inc. | Systems and user interfaces for dynamic and interactive investigation of bad actor behavior based on automatic clustering of related data in various data structures |
CN106254153A (en) * | 2016-09-19 | 2016-12-21 | 腾讯科技(深圳)有限公司 | A kind of Network Abnormal monitoring method and apparatus |
CN106294508A (en) * | 2015-06-10 | 2017-01-04 | 深圳市腾讯计算机系统有限公司 | A kind of brush amount tool detection method and device |
CN106469276A (en) * | 2015-08-19 | 2017-03-01 | 阿里巴巴集团控股有限公司 | The kind identification method of data sample and device |
CN107465648A (en) * | 2016-06-06 | 2017-12-12 | 腾讯科技(深圳)有限公司 | The recognition methods of warping apparatus and device |
CN107517394A (en) * | 2017-09-01 | 2017-12-26 | 北京小米移动软件有限公司 | Identify the method, apparatus and computer-readable recording medium of disabled user |
CN108197958A (en) * | 2018-01-23 | 2018-06-22 | 北京小米移动软件有限公司 | Count the method, apparatus and storage medium of ox under line |
CN108269012A (en) * | 2018-01-12 | 2018-07-10 | 中国平安人寿保险股份有限公司 | Construction method, device, storage medium and the terminal of risk score model |
CN108540431A (en) * | 2017-03-03 | 2018-09-14 | 阿里巴巴集团控股有限公司 | The recognition methods of account type, device and system |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9595006B2 (en) * | 2013-06-04 | 2017-03-14 | International Business Machines Corporation | Detecting electricity theft via meter tampering using statistical methods |
JP7057913B2 (en) * | 2016-06-09 | 2022-04-21 | 株式会社島津製作所 | Big data analysis method and mass spectrometry system using the analysis method |
CN108229963B (en) * | 2016-12-12 | 2021-07-30 | 创新先进技术有限公司 | Risk identification method and device for user operation behaviors |
-
2018
- 2018-09-25 CN CN201811120248.6A patent/CN109284380B/en active Active
- 2018-12-29 WO PCT/CN2018/125248 patent/WO2020062690A1/en active Application Filing
Patent Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103150374A (en) * | 2013-03-11 | 2013-06-12 | 中国科学院信息工程研究所 | Method and system for identifying abnormal microblog users |
CN104917739A (en) * | 2014-03-14 | 2015-09-16 | 腾讯科技(北京)有限公司 | False account identification method and device |
US9367872B1 (en) * | 2014-12-22 | 2016-06-14 | Palantir Technologies Inc. | Systems and user interfaces for dynamic and interactive investigation of bad actor behavior based on automatic clustering of related data in various data structures |
CN106294508A (en) * | 2015-06-10 | 2017-01-04 | 深圳市腾讯计算机系统有限公司 | A kind of brush amount tool detection method and device |
CN106469276A (en) * | 2015-08-19 | 2017-03-01 | 阿里巴巴集团控股有限公司 | The kind identification method of data sample and device |
CN105630885A (en) * | 2015-12-18 | 2016-06-01 | 国网福建省电力有限公司泉州供电公司 | Abnormal power consumption detection method and system |
CN107465648A (en) * | 2016-06-06 | 2017-12-12 | 腾讯科技(深圳)有限公司 | The recognition methods of warping apparatus and device |
CN106254153A (en) * | 2016-09-19 | 2016-12-21 | 腾讯科技(深圳)有限公司 | A kind of Network Abnormal monitoring method and apparatus |
CN108540431A (en) * | 2017-03-03 | 2018-09-14 | 阿里巴巴集团控股有限公司 | The recognition methods of account type, device and system |
CN107517394A (en) * | 2017-09-01 | 2017-12-26 | 北京小米移动软件有限公司 | Identify the method, apparatus and computer-readable recording medium of disabled user |
CN108269012A (en) * | 2018-01-12 | 2018-07-10 | 中国平安人寿保险股份有限公司 | Construction method, device, storage medium and the terminal of risk score model |
CN108197958A (en) * | 2018-01-23 | 2018-06-22 | 北京小米移动软件有限公司 | Count the method, apparatus and storage medium of ox under line |
Also Published As
Publication number | Publication date |
---|---|
CN109284380A (en) | 2019-01-29 |
WO2020062690A1 (en) | 2020-04-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109284380B (en) | Illegal user identification method and device based on big data analysis and electronic equipment | |
CN111159243B (en) | User type identification method, device, equipment and storage medium | |
JP2022518469A (en) | Information processing methods and devices, storage media | |
CN106960153B (en) | Virus type identification method and device | |
CN107222511B (en) | Malicious software detection method and device, computer device and readable storage medium | |
CN109325548B (en) | Image processing method, image processing device, electronic equipment and storage medium | |
CN103297267A (en) | Method and system for network behavior risk assessment | |
CN110995745B (en) | Method and device for separating and identifying illegal machine card of Internet of things | |
CN111542043B (en) | Method and device for identifying service request for changing mobile phone number | |
CN111064719B (en) | Method and device for detecting abnormal downloading behavior of file | |
CN111353138A (en) | Abnormal user identification method and device, electronic equipment and storage medium | |
CN109905524B (en) | Telephone number identification method and device, computer equipment and computer storage medium | |
CN108076032B (en) | Abnormal behavior user identification method and device | |
CN113727348B (en) | Method, device, system and storage medium for detecting user data of User Equipment (UE) | |
CN111371581A (en) | Method, device, equipment and medium for detecting business abnormity of Internet of things card | |
CN111178347B (en) | Ambiguity detection method, ambiguity detection device, ambiguity detection equipment and ambiguity detection storage medium for certificate image | |
CN109447177B (en) | Account clustering method and device and server | |
CN109951609B (en) | Malicious telephone number processing method and device | |
CN113051601A (en) | Sensitive data identification method, device, equipment and medium | |
CN110909263A (en) | Method and device for determining companion relationship of identity characteristics | |
Di Domenico et al. | Classification of heterogenous M2M/IoT traffic based on C-plane and U-plane data | |
CN114048344A (en) | Similar face searching method, device, equipment and readable storage medium | |
CN113901417A (en) | Mobile equipment fingerprint generation method and readable storage medium | |
CN112751813A (en) | Network intrusion detection method and device | |
CN110944290A (en) | Companion relationship analysis method and apparatus |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |