Detailed Description
Abnormal behavior as described above (e.g., "card maintenance" behavior) requires significant cost control at a scale to achieve profitability. Through long-term data analysis and fighting of card cultivation, the channel card cultivation behavior characteristics in the local market are gradually discovered, and the characteristics are mainly expressed in the aspects of channel monthly momentum, centralized development, low liveness and the like. Accordingly, according to the embodiment of the invention, through three dimensions of 'centralized development, communication behavior and low quality of users', a plurality of conditions for identifying abnormal users (such as card-holding users) are subdivided into three levels of preconditions, requisite conditions and additional conditions, so that hierarchical and flexible abnormal user identification is realized, and the abnormal users with various behavior characteristics can be effectively, accurately and exhaustively identified.
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1A is a schematic view of a cloud service system according to an embodiment of the present invention. The system comprises a cloud server 101, a user terminal 102, an operator server 103 and a service network point device 104. It is understood that the user terminal 102, the operator server 103, and the service site device 104 are only illustrated schematically, and in an actual application scenario, a large number of user terminals 102, operator servers 103, and service site devices 104 are included. After the cloud server 101 establishes connection with the user terminal 102, the operator server 103, and the service node device 104 through various networks (e.g., computer network, mobile communication network) and the like, the user may be prompted to collect various information listed in the interface by a prompt interface under the condition that the relevant user agrees, for example, the user is prompted to collect various information listed in the interface by the prompt interface, and the corresponding user information is collected under the condition that the corresponding user agrees to be collected on the interface, so as to execute the method for identifying the abnormal user in the corresponding mobile communication of the present application.
For the cloud server 101, a cloud server can be provided, the cloud service can realize the increase, use and interaction of related services based on a network, dynamic and easily extensible and virtualized resources can be provided through the internet, and a cloud computing environment loaded on the cloud server 101 may include components such as a cloud management function, a software as a service (SaaS) layer, a platform as a service (PaaS) layer, and an infrastructure as a service (IaaS) layer. These components work in concert to enable various services to be provided by the cloud computing environment via a cloud or networked environment, for example, for providing the results of a method of anomalous user identification (i.e., a set of identified anomalous users) in accordance with an embodiment of the present invention.
The cloud management functionality may provide integrated management of various cloud services including, but not limited to, SaaS, PaaS, IaaS. For example, cloud management functions may include provisioning, managing, and tracking various cloud services to which a user subscribes.
The SaaS layer may provide software-level cloud services, for example, by directly interacting with users through a user interface. For example, the SaaS layer may provide the ability to build and deliver a suite of on-demand applications on an integrated development and deployment platform. The cloud services provided by the SaaS layer facilitate a user obtaining a set of identified anomalous users using an application executing on the cloud computing environment, e.g., using the methods illustrated in fig. 1B-4C and/or the apparatus illustrated in fig. 6.
The PaaS layer may, for example, provide a solution for platform-level development and distribution of applications. For example, the PaaS layer may provide a distributed operating system for users with development requirements, so as to implement collaborative development of users and flexible extension of resources. The services provided by the PaaS layer facilitate users in leveraging programming languages and tools supported by the cloud computing environment and controlling deployed services, enabling improvements or training, for example, for the various features of the methods shown in fig. 1B-4C and/or the apparatus shown in fig. 6.
The IaaS layer may, for example, provide the hardware resources of various underlying layers at the infrastructure level. For example, the IaaS layer may provide servers, storage, and other network hardware to save hardware maintenance costs for users and improve office floor constraints. The IaaS layer facilitates management and control of hardware resources by users, e.g., may be used to implement smart devices as described above on hardware.
The methods shown in fig. 1B-4C and/or the apparatus shown in fig. 6 and/or the smart device shown in fig. 7 may be implemented, alone or in combination, on one or more of the SaaS, PaaS, IaaS layers described above, and may be implemented in hardware, software, or a combination of software and hardware, according to embodiments of the invention. In other words, methods, apparatuses, devices, and/or computer-readable storage media and the like according to aspects of embodiments of the present invention may be implemented in a cloud computing environment, for example, a person skilled in the art may implement the respective features individually or in combination as needed by arranging the respective features at respective hierarchies. For example, a user may subscribe to one or more services provided by the cloud computing environment. The cloud computing environment may then perform processing to provide one or more services subscribed to by the user, such as providing an abnormal set of users identified by implementing the methods illustrated in fig. 1B-4C and/or the apparatus illustrated in fig. 6, and/or related development functionality.
FIG. 1B is a flow diagram illustrating an example method 100 for anomalous user identification in accordance with an embodiment of the present invention.
According to an embodiment of the invention, the abnormal users generally comprise users with abnormal communication behaviors, such as card-raising users and low-quality customers. The present invention is described below in an embodiment of the present invention by taking a card user as an example of an abnormal user, but the present invention is not limited thereto. A card-holding user generally refers to a user corresponding to a SIM card used for card holding, and one card-holding user corresponds to one user identification, such as the identification of the SIM card. Typically, the data associated with the card support includes, but is not limited to, the user's call duration, traffic size, number of text messages, billing income, and the like. Such data is usually recorded by the operator for daily operations, and can be directly obtained in the operation data of the operator side (for example, a server set by the operator), without involving privacy of the user person (for example, in the case of manually analyzing the user personal data).
As shown in fig. 1B, the example method 100 for anomalous user identification may be performed by an intelligent device, which may be, for example, a dedicated server set up in the background by an operator. The method comprises the following steps:
step S101: a first set of users is obtained.
The first set of users may include a plurality of subscriber identities (e.g., identities of SIM cards). The subscriber identity may be mapped to the operational data of the corresponding subscriber in the operational data from the operator, e.g. for analysis related to abnormal subscriber identification. The first set of users may be any set of user identifications to be filtered that may contain user identifications of abnormal users (e.g., card-holding users), i.e., a set of users suspected of being abnormal (e.g., card-holding users). For example, the first set of users may be a set of users obtained from an operator or other data source according to some predetermined condition or arbitrarily, which is not limited by the present invention.
Step S102: liveness associated data for a user indicated by each user identification in the first set of users is obtained.
As previously described, similarly, the user identification may be mapped to activity association data for the respective user (e.g., stored in a server of the operator). Accordingly, liveness associated data for each user of the set of users may be obtained therefrom. Generally, the activity-related data includes the duration of the call (e.g., in minutes) indicated by the subscriber identity, the traffic size (e.g., in M), the number of sms messages, and any other data that can reflect the frequency of the subscriber performing communication within a limited time (e.g., using the corresponding SIM card), which is not limited by the present invention.
Step S103: a second set of users is determined from the first set of users based on the liveness association data.
The frequency of communication behaviors (such as short messages, calls and the like) of the user can be determined according to the activity correlation data, and whether the user is possibly an abnormal user (such as a card maintenance user) is judged according to the frequency. For example, the higher the frequency of use (i.e., the higher the activity), the more likely the user is a normal user, while the lower the frequency of use (i.e., the lower the activity), the more likely the user is an abnormal user (e.g., a card-maintained user). To this end, a respective activity threshold may be set for each activity associated data, or a single total activity threshold may be set for multiple types of activity associated data to facilitate the determining operation, as will be described in detail below. Therefore, according to the activity degree association data and the correspondingly set threshold value, the activity degree of the user indicated by each user identifier in the first user set can be measured, and accordingly, the user identifiers corresponding to one or more users meeting the activity degree threshold value (for example, lower than a certain activity degree threshold value) are screened out from the first user set as the second user set (for example, as the set of suspected abnormal users (for example, card-raising users) with a reduced range).
According to an embodiment of the present invention, the determination (or screening) based on the activity-level correlation data can be used as a necessary condition for the identification of the abnormal user (e.g., card-keeping user), because the activity level is generally low to save the cost for the abnormal behavior (e.g., card-keeping behavior), so when the activity level is high enough, the abnormal behavior (e.g., card-keeping behavior) itself loses the meaning of profit-making.
Step S104: behavioral characteristic data for each user in the second set of users is obtained for the user indicated by the user identification.
As previously mentioned, similarly, the user identification may be mapped to behavioral characteristic data of the respective user (e.g., stored in a server of the operator). Accordingly, behavioral characteristic data for each user of the set of users may be obtained therefrom. Generally, the behavior feature data includes, in addition to the call duration, the traffic size and the number of sms as described above, the billing income (e.g., in units of elements) of the user indicated by the user identifier, a contact number, a user identity (e.g., a certificate number used for user registration, an IMEI, etc.), an access base station, an activation time, an offline time, and any other data capable of characterizing the communication behavior of the user, which is not limited in this respect. It is easily conceivable that the behavior feature data and the activity degree related data may have overlapping portions, such as the call duration, the traffic size, the number of messages, and the like.
Step S105: a third set of users is determined from the second set of users based on the behavioral characteristic data.
Communication behavior characteristics (e.g., classified communication behavior characteristics) of a user indicated by the user identification can be analyzed according to the behavior characteristic data, and whether the user is likely to be an abnormal user (e.g., a card-maintenance user) can be judged according to the communication behavior characteristics. The category of the behavior characteristics may include contact number set, short boot time, account opening identity set, access base station set, IMEI set, short-term offline, and any other behavior characteristics capable of indicating that one or more users may be abnormal users (e.g., card-maintained users), which is not limited by the present invention.
For example, the behavioral feature "IMEI set" may refer to a terminal activating or making a call for users greater than or equal to a certain threshold (e.g., 5) using the same IMEI. For example, the behavioral characteristic "short-term offline" may refer to newly-developed users being offline within a certain time threshold (e.g., three months, such as a three month time within which the current month of online is calculated), or being no longer billed after the time threshold.
After analyzing the behavioral characteristic data of the users, the user identities of users satisfying a certain classification (e.g., short-term off-grid) may be recorded into a third set of users, or the user identities of a plurality of users satisfying a certain classification (e.g., IMEI set) may be recorded into a third set of users. For example, if a user is off-line within three months or not billed after three months from the behavior feature data, the user identification of the user may be recorded into a third set of users. For example, the user identities of a plurality of users that satisfy a particular classification (e.g., in an IMEI set) may not be logged into the third set of users until the number of these users is greater than a certain threshold (e.g., 5). It will be appreciated that if multiple users are identified from the behaviour profile data (especially if above a certain threshold) using the same IMEI for terminal activation or phone calls, this is likely to be the case with "cat pool" card support. Further behavioral characteristic data and associated analysis will be described below.
Therefore, according to the behavior feature data and the classification set correspondingly, behaviors of one or more user identifiers in the second user set can be classified, one user identifier meeting a certain classification condition or a plurality of user identifiers (larger than a certain threshold value) meeting a certain classification condition is selected from the second user set to serve as a third user set, and the screened user identifiers are recorded into the third user set to serve as a suspected abnormal user (for example, card-raising user) set with a further reduced range.
According to embodiments of the present invention, the determination (or filtering) from the behavioral characteristic data may be used as an additional condition for abnormal user (e.g., card-maintained user) identification to further determine (or filter) a larger range of user sets based on determining that requisite conditions are satisfied in order to more accurately locate one or more abnormal users (e.g., card-maintained users). Because users who satisfy both low liveness and a certain abnormal behavior feature are more likely to be abnormal users (e.g., card maintenance users), or a plurality of users presenting a behavior concentration feature that satisfy both low liveness and a certain abnormal behavior feature are more likely to be abnormal users (e.g., card maintenance users), such two-stage determination (or screening) of liveness analysis plus behavior feature analysis can identify abnormal users (e.g., card maintenance users) more efficiently than performing a plurality of screens out of order alone, and can reduce data throughput.
Fig. 2A-2B are flowcharts illustrating an example method for determining a first set of users, wherein the first set of users may be obtained based on the number of newly added users of a plurality of target nodes, according to an embodiment of the present invention.
As shown in fig. 2A, an example method 200 for determining a first set of users may implement step S101 (i.e., obtain the first set of users) as shown in fig. 1B and may include the steps of:
step S201: operational data for a plurality of target network sites is obtained from an operator database. The operational data may include a number of new users per day for each of a plurality of mesh points over a predetermined period of time.
For example, the predetermined period of time may typically be a certain month or a specified number of days.
Step S202: and for each target network point, determining whether the target network point meets a preset condition or not based on the number of the newly added users every day.
The predetermined condition here may be a predetermined condition as mentioned previously in step S101. For example, the predetermined condition may be used to determine one of a plurality of websites within a channel (e.g., which may be a newly developed channel or other channel that needs to be checked to assess whether an anomalous user (e.g., a card-fed user) exists) as a set of suspected anomalous users (e.g., card-fed users). Specific examples of the predetermined condition will be described below.
Step S203: and if the target network point is determined to meet the preset condition, determining a plurality of user identifications corresponding to the target network point as a first user set. Otherwise, it is possible to go back to step S202 to execute step S202 for the next target node.
As shown in fig. 2B, an example method 200' for determining a first set of users may include the steps of:
steps S201 and S203 are the same as fig. 2A, and a description thereof is omitted here.
Steps S2021 to S2024 may implement step S202 shown in fig. 2A, and specifically include:
step S2021: based on the number of newly-added users of the target network point in a preset time period every day, calculating the number of newly-added users of the target network point in a first day in a first sub-time period of the preset time period and the number of newly-added users of the target network point in a second day in a second sub-time period of the preset time period, and comparing the ratio of the number of newly-added users of the first day to the number of newly-added users of the second day with a first threshold value.
For example, the predetermined time period may be a specified month (e.g., any one of 1-12 months of a specified year), the first sub-time period may be a specified last few days (e.g., last five days) of the specified month, and the second sub-time period may be other days of the specified month except for the specified last few days (e.g., last five days). In one example, the first sub-period is taken as the last five days of a month of a year, the second sub-period is taken as the first twenty-six days of the month, and the number of newly added users per day is recorded as X
iWherein X denotes the number of new users, the corner mark i denotes the date of the month, for example, the number of new users on the sixth day of the month is marked as X
6. In this example, the daily newly added users for the last five days (i.e., the number of newly added users for the first day) are calculated as
(i-27, 28, … 31); the number of newly added users on the first twenty-six days (namely the number of newly added users on the second day) is calculated as
(i ═ 1,2, … 26). Then, the number of the newly added users on the first day and the number of the newly added users on the second dayThe ratio of the quantities is calculated as
In this case, the ratio
It may be indicated whether the net point has reached a "suspicious" level over the average daily user development over other periods of the month, since the development of certain card care users may tend to occur at the end of the month in order to achieve better performance in the "rush" at the end of the month. For example, the first threshold may be set to 1.4, a level indicating that 40% is exceeded. Alternatively, the ratio may be replaced by a difference between the number of newly added users on the first day and the number of newly added users on the second day, which is not limited in the present invention.
Step S2022: and calculating the accumulated number of the newly added users of the target network point in a third sub-time period of the preset time period, and comparing the accumulated number of the newly added users with a second threshold value.
For example, the third sub-period may be any selected three consecutive days within the predetermined period. In one example, the third sub-period is from the twentieth day to the twenty-second day of a month of a year, and the cumulative number of newly added users in the third sub-period is calculated as Ni=20~22=∑XiAnd N indicates the number of the accumulated newly added users, and the corner mark i is 20-22 to indicate that the corresponding continuous time is from the twentieth day to the twenty-second day of the month. In this case, it may be indicated whether the number of intensively developing users (i.e., the cumulative number of newly increasing users) in the consecutive three days is significantly above the normal level. For example, the second threshold may be set to 50 or 100 or other number, and the present invention is not limited thereto.
As shown, steps S2021 and S2022 may be performed in parallel, and accordingly, subsequent steps S2023 and S2024 may also be performed in parallel, respectively. The following were used:
step S2023: it is determined whether the calculated ratio is greater than or equal to a first threshold.
Step S2024: and judging whether the number of the accumulated newly added users is greater than or equal to a second threshold value.
As shown, steps S2023 and S2024 converge to an arrow pointing to step S203, which means that there is a logical OR relationship between the two determinations. That is, if the result of the comparison of steps S2021 and S2023 indicates that the difference value is greater than or equal to at least one of the first threshold value and the cumulative number of newly added users is greater than or equal to the second threshold value, it is determined that the target site satisfies the predetermined condition. That is, the target site is determined to be a site suspected of having an abnormal user (e.g., a card-accrued user) and is worthy of further screening (determined to be the first set of users).
Here, methods 200 and 200' may be used to determine a website from a plurality of websites as a set of users to be screened. Further, the methods 200 and 200' may be used as "preconditions" to implement hierarchical and flexible identification of abnormal users (e.g., card-maintained users) in combination with the "prerequisite" and "additional" conditions, which is more beneficial to improve the efficiency of identification.
FIG. 3 is a flow diagram illustrating an example method 300 for determining a second set of users in accordance with an embodiment of the present invention.
As shown in fig. 3, the example method 300 for determining the second set of users may implement step S103 shown in fig. 1B (i.e., determining the second set of users from the first set of users according to the activity-associated data), in an embodiment according to the present invention, the activity-associated data includes at least the call duration, the traffic size, and the number of sms of the user indicated by the user identifier, and the method 300 may include the following steps after step S102 as described above:
s1031: for each user identification in the first set of users: and calculating the weighted sum of the call duration, the flow size and the short message number of the user indicated by the user identification.
For example, the call duration, the traffic size, and the number of messages of the user indicated by the selected user identifier may be data of a certain time period (e.g., a certain month of january). Also, since the units of the respective indices are usually different, the respective indices may be normalized before calculating the weighted sum. Further, the weight of each index may default to 1, or may be set differently as needed. In this example, the weighted sum so calculated may indicate that the user is active high or low during the month, i.e., the higher the weighted sum, the higher the activity, and the lower the weighted sum, the lower the activity.
S1032: the calculated weighted sum is compared to a third threshold, e.g., to determine if the weighted sum is less than or equal to the third threshold.
The third threshold may be any weighted sum threshold that can reflect the activity level, and is shown here as an example term, which is not intended to limit the invention.
S1033: recording the user identification to a second set of users if the calculated weighted sum is less than or equal to the third threshold. Otherwise, it may return to step S1031 to perform steps S1031 to S1032 for the next user identifier.
Optionally, the method 300 may further include the following steps after the step S1033 as described above:
s1034: the call duration, the traffic size, and the number of short messages of the user indicated by each user identifier in the second user set are respectively compared with the low call duration threshold, the low traffic size threshold, and the low number of short messages threshold, for example, to determine whether the call duration, the traffic size, and the number of short messages are respectively less than or equal to the corresponding thresholds. If so, in step S1035, the corresponding subscriber identity whose call duration is less than or equal to the low call duration threshold, whose traffic size is less than or equal to the low traffic size threshold, and whose number of short messages is less than or equal to the low short message number threshold is taken as the target subscriber identity and is deleted from the second set of subscribers. If not, step 1304 is repeated for the next user identification in the second set of users.
In this case, normalization may not be performed for various indexes. For example, the low call duration threshold may be set to 5 minutes, the low traffic size threshold may be set to 3M, and the low sms number threshold may be set to 4. When the call duration, the traffic size and the number of short messages are respectively less than or equal to corresponding threshold values, the user can be determined as an extremely low user. When the call duration, the traffic size, and the number of short messages are all zero, the user may be determined as "three-no-user (i.e., no call, no traffic, no short message)".
S1035: as described above, the corresponding subscriber identity whose call duration, traffic size and number of short messages simultaneously satisfy the threshold condition (e.g., less than or equal to the corresponding threshold) is determined as the target subscriber identity and is deleted from the second set of subscribers. That is, the respective target user identities in the second set of users are deleted.
Through steps S1034 to S1035, user identifiers corresponding to users with significantly low liveness (i.e., significantly low-quality users) in the second set of users may be filtered out. Such poor quality users, whether card-fed or not, may not be worth the operator to provide service to them again. For example, a user of such poor user's communication behavior ("none at three" and "extremely low usage") may then be analyzed for a duration of time exceeding a certain time (e.g., three months) to further determine whether service should cease to be provided to the user.
In one embodiment, S1034 may be performed before S1031 to delete the user identifications corresponding to the "none three" and "very low use" users from the second set of users.
Fig. 4A-4C are flowcharts illustrating an example method for determining a third set of users, according to an embodiment of the present invention.
In an embodiment according to the invention, the behavior feature data may comprise one or more of an outbound revenue of the user indicated by the user identification, a number of short messages, a call duration, a traffic size, a contact number, a user identity. And, besides IMEI centralization and short-term off-network as described above, the behavior feature classification may also include contact number centralization, short power-on time, account opening identity centralization, access base station centralization, and the like.
For example, a set of contact numbers may refer to a number of contact numbers (caller id + callee) less than or equal to a certain threshold (e.g., 3) for a user at a specified time (e.g., a specified month), where the contact numbers typically do not contain a customer service telephone number such as 10000.
For example, a short boot time may refer to a user having a boot time less than or equal to a certain threshold (e.g., 3 days) within a specified time (e.g., a specified month), where the behavior of booting above a certain threshold (e.g., 2 hours) within a natural day may be determined to be booting on that day, with the boot time being calculated on a daily basis.
For example, an account opening identity set may refer to users that are greater than a certain threshold (e.g., 3) within a specified time (e.g., a specified month) opening accounts using the same credentials.
For example, accessing a set of base stations may refer to a user accessing a number of base stations less than or equal to a certain threshold (e.g., 3) in a specified time (e.g., a specified month).
As shown in fig. 4A, an example method 400 for determining a third set of users may implement step S105 shown in fig. 1B (i.e., determining a third set of users from the second set of users according to behavior feature data), and the method 400 may include the steps of:
step S401: behavioral characteristic data of each user in the second set of users is analyzed for the user indicated by the user identification.
As mentioned above, the analysis may be to analyze the communication behavior characteristics of the user indicated by the user identifier (e.g., classify the communication behavior characteristics of the user) to determine one or more user identifiers satisfying a certain abnormal behavior characteristic (i.e., a certain classified communication behavior characteristic) as a result of the analysis.
Step S402: a third set of users is determined from the second set of users based on results of the analysis.
As previously mentioned, one or more user identifications for a particular abnormal behavior characteristic determined in step S402 may be included in the third set of users as a set of users suspected to be abnormal (e.g., card-maintained users). In another example, the plurality of user identifications may be included in the third set of users only when the determined number of the plurality of user identifications for a particular abnormal behavior feature is greater than or equal to a threshold (e.g., a particular card raising behavior feature has a centralized performance). Here, including in the third set of users means recording the determined user identification into the third set of users.
In more specific embodiments, steps S401 and S402 may have different implementation methods.
As shown in fig. 4B, steps S401 and S402 may be implemented with the example method 400'. The method 400' includes the steps of:
steps S4011 to S4013 may implement step S401 as shown in fig. 4A.
Step S4011: a behavioral characteristic score is calculated for each user in the second set of users identifying the indicated user.
According to embodiments of the invention, the behavioral characteristic score may be calculated based on the behavioral characteristic data for evaluation, analysis, classification, or the like of the behavioral characteristic. For example, the behavior feature score S may be a normalized representation of behavior feature data, such as (S- μ)/σ, where μ is the mean of the behavior feature data (e.g., for a particular time period of a month) and σ is the variance of the behavior feature data. However, the invention does not limit the method for calculating the behavior feature score, and other standardized methods for measuring the behavior feature data are also possible.
Step S4012: and deducing the similarity of the behavior characteristics between every two users indicated by the user identifications from the calculated behavior characteristic scores of the users indicated by each user identification.
According to the embodiment of the invention, the behavior feature similarity is calculated based on the behavior feature scores of every two users, so as to be used for representing the degree of behavior similarity between the two users. In one example, the behavioral feature similarity may be described directly in terms of a similarity distance (e.g., a euclidean distance). In this case, the distance d between the user a and the user b may be expressed as:
where j is an index indicating the type of the behavior feature score. For example, j ═ 1 indicates that the behavior feature score is a charge-out score, and j ═ 2 indicates that the behavior feature score is a short message score, and the setting can be performed as needed. Furthermore, inIn this case, the lower the calculated behavioral feature similarity A (i.e., the distance characterizing the similarity), the higher the similarity of the selected behavioral features between the two users. In another example, the similarity of selected behavioral characteristics between two users may also be represented by the difference between 1 and the normalized similarity distance value. In this case, it is numerically more directly reflected that the higher the calculated behavior feature similarity A (i.e., (1-distance value)), the higher the similarity of the selected behavior feature between the two users.
Step S4013: and regarding the selected user identification in the second user set, taking the user identifications of other users within the range of the similarity threshold of the behavior characteristics of the user indicated by the selected user identification as the associated user identification of the selected user identification.
According to an embodiment of the present invention, for a selected user identifier, step S4013 is performed to calculate a phase feature similarity for a certain behavior feature data/score between the selected user identifier and other user identifiers in the same user set. For example, for the short message score of the user identifier a (for example, represented by j ═ 2), the short message scores of the other user identifiers b-z in the second user set may be calculated, and the short message score similarity a may be calculated one by one2(a,b)、A2(a,c)、……、A2(a, z) (e.g., calculated using (1-normalized similarity distance value)). Then, for example, the behavior feature similarity threshold of the SMS score is set to 0.97, then A2(a,b)-A2And (a, z) the user identifier corresponding to the similarity greater than or equal to 0.97 is the associated user identifier required to be recorded. For example, if wherein A2(a,c)、A2(a,f)、A2(a,g)、A2(a,m)、A2(a, n) and A2And (a, r) is greater than or equal to 0.97, recording the user identifications c, f, g, m, n and r as the associated user identifications. Various behavioral characteristic data/scores may also be selected to calculate an (e.g., weighted) overall similarity, according to embodiments of the present invention.
Steps S4021 to S4023 may implement step S402 as shown in fig. 4A.
Step S4021: the number of associated user identities is calculated.
In the above example, the user identities c, f, g, m, n and r are accumulated to 6 for the short message score.
Step S4022: the sum of the calculated number of associated user identifications plus one is compared to a predetermined number threshold, e.g., to determine whether the number of associated user identifications plus one is greater than or equal to the number threshold.
Step S4023: if it is determined that the sum of the calculated number of associated user identities plus one is greater than or equal to the predetermined number threshold, the selected user identity is recorded together with its associated user identity in a third set of users.
In the case where the predetermined number threshold is 5, the number of associated user identities accumulated as above, 6+1 — 7, is compared with 5. Since 7 > 5, the corresponding user identities c, f, g, m, n and r and a are recorded into the third set of users.
Here, an example similarity algorithm associated with the card-maintained user identification is additionally given for implementing steps S4012 to S4013 and steps S4022 to S4023: according to the monthly charge, calling times, using flow and short message sending times of the newly-accessed network users, the similar users in the same user set are more than or equal to 5, that is, all the similar users (including the selected users) are determined as suspected card-keeping users. A specific example computing flow may be as follows:
step 1: index selection and normalization (equivalent to step S4011). For example, the average value and standard deviation of the charge income, the number of calls, the usage flow rate, the number of short messages of the newly developed user are selected and calculated according to the service type. Then, the scores of the charge-out income, the calling times, the usage high flow and the short message quantity are respectively calculated. For example, the revenue score for the charge-out (revenue-average value of charge-out)/standard deviation of charge-out.
Step 2: user similarities between the selected user and other users in the same user set (e.g., users in the same channel or the same website or other scope) are calculated (equivalent to steps S4012 to S4013). For example, for user a, the overall similarity between him and user b may be calculated as follows:
in this example, the higher the calculated overall similarity is, it indicates that the common communication behaviors such as income, call, traffic, short message, etc. between users are very close, and it is possible for the card-maintained user to simulate the communication behavior in batch by using an algorithm executed on hardware.
And step 3: users with similar behaviors (i.e., associated user identities) are determined (equivalent to steps S4021 to S4023). For example, two users are determined to be similar if the degree of similarity between the two users is greater than or equal to a certain threshold (e.g., 0.97, in which case the distance between them is less than 0.03). Then, if the users in the same user set that are similar to the selected user plus the selected user (i.e., all similar users for the behavioral characteristic data/score) exceed a threshold number (e.g., 5), the selected user and the users with similar behavioral characteristics are determined to be suspected card-raising users.
The algorithm of similarity and the associated method of determining (filtering) out the third set of users may vary depending on the selected behavior feature data or combination of behavior feature data, and the invention is not limited thereto.
As shown in fig. 4C, steps S401 and S402 may be implemented with the example method 400 ". The method 400 "includes the steps of:
steps S4014 to S4015 may implement step S401 as shown in fig. 4A.
Step S4014: the IMEI of the user indicated by each user identification in the second set of users is extracted.
In this example, the behaviour characteristic data of the user indicated by the user identity is defined as IMEI, and a search can be made from the behaviour characteristic data indexed by the user identity to extract the corresponding IMEI set (e.g. from a server of the operator). Since IMEI is an identification type of data, the similarity comparison between users can be done bitwise for IMEI, and in one example, only the comparison determines whether the IMEI is the same or different between users.
Step S4015: and regarding the selected user identification in the second user set, taking the user identifications of other users with the same IMEI as the user indicated by the selected user identification as the associated user identification.
Here, multiple users of the same IMEI indicate that these users are active and communicating in the same hardware. When the number of users with the same IMEI (here, the number of selected user identities plus other user identities with the same IMEI) is greater than or equal to a certain threshold (e.g., 5), it is said that the "same hardware" is likely to be a "cat pool" rather than a "dual card dual standby type handset" commonly used by general users. This situation may be referred to as "IMEI convergence".
Steps S4021 to S4023 may implement step S402 as shown in fig. 4A, which is the same as fig. 4B, and thus a description thereof is omitted.
A set of users suspected of being abnormal (e.g., card-maintained users) is further determined by determining that a number of users for a particular abnormal behavior signature reaches a threshold (e.g., IMEI set). Since the method of fig. 4A-4C can be performed after the method of fig. 3 (optionally, the method of fig. 2A-2B) (as shown in fig. 1B), hierarchical and flexible identification of abnormal users (e.g., card-maintenance users) can be realized, so as to sequentially perform level-by-level progressive screening according to the order of "preconditions", "prerequisites" and "additional conditions" (where "preconditions" may be optional), which is beneficial to reduce the amount of processed data, increase the processing speed and improve the accuracy of identification.
Fig. 5 is a diagram illustrating behavior similarity according to an embodiment of the present invention, wherein the behavior similarity is described in terms of traffic and call times, and is measured in terms of distance.
As shown in fig. 5, the number of calls indicated by the horizontal axis increases in the positive direction, and the usage flow rate indicated by the vertical axis increases in the positive direction. Each scattered point in the graph indicates one user, and the position of the point indicates the characteristics of both the usage flow rate and the number of calls of the user. Accordingly, the closer the distance between a point and the point is, the higher the degree of similarity between users indicated by the point is. If the 7 users circled in the graph are users with high similarity (for example, the similarity distance between each other is smaller than a certain threshold), and the number of users in the area (i.e., within the threshold range of the similarity of the behavior features) is larger than a certain threshold (for example, 5), the 7 users can be determined as suspected abnormal users (for example, card-maintenance users).
Alternatively, the similarity may be indicated by a three-dimensional space or an even more-dimensional space (when 3 or more behavior feature data are used to indicate the similarity of behavior features), which is not limited in this respect.
FIG. 6 is a block diagram illustrating an example apparatus 600 for anomalous user (e.g., card maintenance user) identification in accordance with an embodiment of the present invention.
As shown in fig. 6, the example apparatus 600 may include a first filtering unit 601, a second filtering unit 602, and a third filtering unit 603, which are configured to perform three rounds of filtering on input data to achieve hierarchical filtering, thereby improving recognition efficiency and accuracy and reducing the amount of processed data.
Specifically, the first filtering unit 601 may perform step S101 shown in fig. 1B or steps S201 to S203 shown in fig. 2A-2B with the operation data as input to generate the first set of users to the second filtering unit 602. The second filtering unit 602 may perform steps S102 to S103 shown in fig. 1B or steps S1031 to S1035 shown in fig. 3 with the first set of users as input to generate a second set of users for the third filtering unit 603. The third filtering unit 603 may implement steps S104 to S105 shown in fig. 1B or steps S401 to S402 shown in fig. 4A to 4C with the second set of users as input to generate a third set of users as a set of finally identified abnormal users (e.g., card-raising users).
Further, in accordance with embodiments of the present invention, the various methods and apparatus described above may be implemented using neural networks for machine learning and training to optimize parameters. However, the present invention is not limited to the type of neural network, and may be a recurrent neural network in general. For example, various threshold parameters or combinations of parameters of the neural network may be trained based on minimizing one or more of a difference between an actual number of anomalous users included in the first set of users and a total number of users of the first set of users, a difference between an actual number of anomalous users included in the second set of users and a total number of users of the second set of users, and a difference between an actual number of anomalous users included in the third set of users and a total number of users of the third set of users. That is, the neural network may be trained based on the set of filters in each round containing a greater number of actual outlier users (i.e., increasing the efficiency of each filter). Alternatively, training may also be based on the screening speed (time difference between input and output results). Also, the various elements of the example apparatus may be implemented separately with a neural network, or may be implemented as a single neural network as a whole.
Fig. 7 is a block diagram illustrating a smart device 700 according to an embodiment of the present invention.
As shown in fig. 7, the smart device 700 may include a storage 701 and a processor 702. The storage 701 is used to store computer programs. The processor 702 runs the stored computer program for implementing the various methods as described above.
The storage device 701 may include a volatile memory (volatile memory), such as a random-access memory (RAM); the storage device may also include a non-volatile memory (non-volatile memory), such as a flash memory (flash memory), a solid-state drive (SSD), etc.; the storage means may also comprise a combination of memories of the kind described above.
The processor 702 may be a Central Processing Unit (CPU). The processor may further include a hardware chip. The hardware chip may be an application-specific integrated circuit (ASIC), a Programmable Logic Device (PLD), or the like. The PLD may be a field-programmable gate array (FPGA), a General Array Logic (GAL), or the like.
Optionally, the storage device is further for storing program instructions. The processor may invoke the program instructions to implement the method as shown in the embodiments of fig. 1B-4C of the present application.
Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to execute the method provided by the foregoing embodiment.
The steps in the method of the embodiment of the application can be sequentially adjusted, combined and deleted according to actual needs.
The modules in the device can be merged, divided and deleted according to actual needs.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
While the invention has been described with reference to a number of embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.