Detailed Description
The abnormal behavior (e.g., the "card keeping" behavior) as described above requires that the cost be effectively controlled at a certain scale to achieve profitability. After long-term data analysis and hit the card raising fight, the behavior characteristics of the channel card raising in the local market are gradually clarified, and the behavior characteristics are mainly expressed in the aspects of the month-end impulse, the concentrated development, the low liveness and the like of the channel. Accordingly, according to the embodiment of the invention, through three major dimensions of concentrated development, communication behaviors and low quality of users, various conditions for identifying abnormal users (such as card-keeping users) are subdivided into three levels of preconditions, preconditions and additional conditions, so that hierarchical and flexible abnormal user identification is realized, and abnormal users with various behavior characteristics can be effectively, accurately and thoroughly identified.
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Fig. 1A is a schematic diagram of a cloud service system according to an embodiment of the present invention. The system comprises a cloud server 101, a user terminal 102, an operator server 103 and service network point equipment 104. It should be understood that the user terminal 102, the operator server 103, and the service node device 104 in the figures are only schematic, and in a practical application scenario, a large number of user terminals 102, operator servers 103, and service node devices 104 are included. After the cloud server 101 establishes a connection with the user terminal 102, the operator server 103, the service network point device 104 through various networks (such as a computer network and a mobile communication network), etc., the cloud server can prompt the user to collect various information listed in the interface under the condition that the related user agrees, for example, through a prompt interface, and the corresponding user gathers corresponding user information for executing the method for identifying abnormal users in the mobile communication corresponding to the application under the condition that the corresponding user agrees to be able to collect on the interface.
For the cloud server 101, a cloud server can be provided, the cloud service can implement addition, use and interaction of related services based on a network, dynamic and easily-expanded and virtualized resources can be provided through the internet, and a cloud computing environment carried by the cloud server 101 can include components such as a cloud management function, a software as a service (SaaS) layer, a platform as a service (PaaS) layer, an infrastructure as a service (IaaS) layer and the like. These components work cooperatively to enable provision of various services by a cloud computing environment via a cloud or networking environment, for example, for providing results of a method of anomalous user identification (i.e., an identified anomalous user set) in accordance with an embodiment of the invention.
The cloud management functionality may provide integrated management of various cloud services including, but not limited to, saaS layer, paaS layer, iaaS layer. For example, cloud management functions may include provisioning, managing, and tracking various cloud services subscribed to by a user.
The SaaS layer may provide software-level cloud services, for example, by directly interacting with a user through a user interface. For example, the SaaS layer may provide the ability to build and deliver a set of on-demand applications on an integrated development and deployment platform. The cloud services provided by the SaaS layer facilitate a user to obtain an identified set of abnormal users using an application executing on the cloud computing environment, for example, using the methods illustrated in fig. 1B-4C and/or the apparatus illustrated in fig. 6.
The PaaS layer can, for example, provide a platform-level solution to develop and distribute applications. For example, the PaaS layer may provide a distributed operating system for users in need of development to enable collaborative development of users and elastic expansion of resources. The services provided by the PaaS layer facilitate user utilization of programming languages and tools supported by the cloud computing environment and control of deployed services to enable, for example, improvement or training of various features of the methods shown in fig. 1B-4C and/or the apparatus shown in fig. 6.
The IaaS layer may, for example, provide various underlying hardware resources at the infrastructure level. For example, the IaaS layer may provide servers, storage, and other network hardware to save the user's hardware maintenance costs and improve office space constraints. The IaaS layer facilitates management and control of hardware resources by users, e.g., can be used to implement the smart devices described above on hardware.
According to an embodiment of the present invention, the method shown in fig. 1B-4C and/or the apparatus shown in fig. 6 and/or the smart device shown in fig. 7 may be implemented on one or more of the SaaS layer, paaS layer, iaaS layer as described above, alone or in combination, and may be implemented in hardware, software, or a combination of software and hardware. In other words, methods, apparatus, devices, and/or computer-readable storage media, etc., according to aspects of embodiments of the present invention may be implemented into a cloud computing environment, e.g., one skilled in the art may implement the various features arranged at respective levels, individually or in combination, as desired. For example, a user may subscribe to one or more services provided by a cloud computing environment. The cloud computing environment may then perform processing to provide one or more services to which the user is subscribed, such as providing an abnormal set of users identified by implementing the methods illustrated in fig. 1B-4C and/or the apparatus illustrated in fig. 6, and/or related development functions.
FIG. 1B is a flowchart illustrating an example method 100 for anomalous user identification in accordance with an embodiment of the invention.
According to embodiments of the present invention, an abnormal user typically includes a user with abnormal communication behavior, such as a card keeping user, a low quality customer. The present invention is described below in the embodiments of the present invention using a card-holding user as an example of an abnormal user, but the present invention is not limited thereto. A card-holding user generally refers to a user corresponding to a SIM card for holding a card, and one card-holding user corresponds to one user identification, such as an identification of the SIM card. Typically, the data associated with the card maintenance includes, but is not limited to, the user's call length, traffic volume, number of messages, billing revenue, etc. Such data is typically recorded by the operator for daily operation and may be obtained directly from the operator-side (e.g., server set by the operator) operational data without involving the privacy of the user's individual (e.g., in the case of manual analysis of the user's personal data).
As shown in fig. 1B, an example method 100 for anomalous user identification can be performed by a smart device, which can be, for example, a dedicated server set by an operator in the background. The method comprises the following steps:
Step S101: a first set of users is obtained.
The first set of users may include a plurality of user identities (e.g., identities of SIM cards). The user identification may be mapped to the operation data of the corresponding user in the operation data from the operator, for example, for analysis related to abnormal user identification. The first set of users may be any set of user identities to be screened that may contain user identities of abnormal users (e.g., card users), i.e., a set of suspected abnormal users (e.g., card users). For example, the first set of users may be a set of users obtained from an operator or other data source according to some predetermined condition or arbitrarily, as the invention is not limited in this regard.
Step S102: liveness association data is obtained for each user identification indicated user in the first set of users.
As previously described, similarly, the user identification may be mapped to liveness association data (e.g., stored in a server of the operator) for the respective user. Accordingly, liveness association data for each user of the set of users may be obtained therefrom. Typically, the liveness related data includes a call duration (e.g., in minutes), a traffic size (e.g., in M), a number of sms messages, and any other data that can reflect how frequently the user is communicating during a limited time (e.g., using a corresponding SIM card) of the indicated user.
Step S103: a second set of users is determined from the first set of users based on the liveness association data.
The frequency of the communication actions (such as short messages, calls and the like) of the user can be determined according to the activity association data, and whether the user is likely to be an abnormal user (such as a card-keeping user) is judged according to the frequency. For example, the higher the frequency of use (i.e., the higher the liveness), the more likely the user is a normal user, and the lower the frequency of use (i.e., the lower the liveness), the more likely the user is an abnormal user (e.g., a card keeping user). For this purpose, a respective liveness threshold value may be set for each liveness related data, or a single total liveness threshold value may be set for a plurality of liveness related data, in order to facilitate the determining operation, which will be described in detail below. Thus, according to the activity association data and the corresponding set threshold, the activity level of the user indicated by each user identifier in the first user set can be measured, and the user identifiers corresponding to one or more users meeting the activity level threshold (for example, lower than a certain activity level threshold) are screened out from the first user set accordingly to serve as a second user set (for example, to serve as a reduced-range suspected abnormal user (for example, card raising user) set).
According to an embodiment of the present invention, the determination (or screening) based on the liveness association data may be used as a prerequisite for the recognition of an abnormal user (e.g., a card keeping user), because the liveness is generally low to save costs for the abnormal behavior (e.g., the card keeping behavior), and when the liveness is sufficiently high, the abnormal behavior (e.g., the card keeping behavior) itself loses the meaning of favoring.
Step S104: behavior feature data is obtained for each user identification of the indicated user in the second set of users.
As previously described, similarly, the user identification may be mapped to behavioral characteristic data of the corresponding user (e.g., stored in a server of the operator). Accordingly, behavioral characteristic data for each user of the set of users may be obtained therefrom. In general, the behavior feature data includes, in addition to the call duration, the traffic size, and the number of sms messages as described above, the outbound revenue (e.g., in units of elements) of the user indicated by the user identification, the contact number, the user identity (e.g., the certificate number used by the user to register, IMEI, etc.), the access base station, the activation time, the off-line time, and any other data capable of characterizing the user's communication behavior, which is not limiting in this invention. It is easily conceivable that the behavior feature data and the liveness association data may have overlapping portions, such as a call duration, a traffic size, and a number of short messages, etc.
Step S105: a third set of users is determined from the second set of users based on the behavioral characteristic data.
Communication behavior characteristics (e.g., classified communication behavior characteristics) of the user identified by the user identification may be analyzed from the behavior characteristic data, and based thereon, it is determined whether the user is likely to be an abnormal user (e.g., a card-keeping user). Classification of behavioral characteristics may include contact number concentration, short turn-on time, account opening identity concentration, access base station concentration, IMEI concentration, short term off-grid, and any other behavioral characteristics that indicate that one or more users may be an abnormal user (e.g., card keeping user), without limitation of the invention.
For example, a behavioral characteristic "set of IMEIs" may refer to users greater than or equal to a certain threshold (e.g., 5) activating or placing calls using terminals of the same IMEI. For example, the behavioral characteristic "short-term off-grid" may refer to a newly developed user being off-grid for a certain time threshold (e.g., three months, such as may be three months of time within which the on-grid current month is calculated), or not being on-grid after that time threshold.
After analyzing the behavior feature data of the user, the user identities of the users satisfying a certain classification (e.g. short-term off-network) may be recorded in a third user set, or the user identities of a plurality of users satisfying a certain classification (e.g. IMEI set) may be recorded in a third user set. For example, if it is analyzed from the behavioral characteristic data that a user is offline within three months or is not billed after three months, the user identification of the user may be recorded in a third user set. For example, the user identities of a plurality of users satisfying a particular classification (e.g., IMEI set) may be recorded into a third user set only if the number of users is greater than a certain threshold (e.g., 5). It will be appreciated that if it is analyzed from the behavioural characteristic data that a plurality of users (especially if greater than a certain threshold) are using terminals of the same IMEI to activate or place calls, this is likely to be the case when using a "cat pool" for card keeping. Further behavioral characteristic data and related analysis will be described below.
Therefore, according to the behavior feature data and the classification set correspondingly, the behaviors of one or more user identifications in the second user set can be classified, and one user identification meeting a certain classification condition or a plurality of user identifications (greater than a certain threshold value) meeting a certain classification condition are selected from the second user set to serve as a third user set, namely, the screened user identifications are recorded in the third user set to serve as a further reduced-range suspected abnormal user (for example, card keeping user) set.
In accordance with embodiments of the present invention, the determination (or screening) based on behavioral characteristic data may be used as an additional condition for the identification of an abnormal user (e.g., card-keeping user) to further determine (or screen) a larger range of user sets based on determining that the requisite condition is met in order to more accurately locate one or more abnormal users (e.g., card-keeping users). Because users that satisfy both low liveness and certain specific abnormal behavioral characteristics are more likely to be abnormal users (e.g., card users), or users that satisfy both low liveness and certain specific abnormal behavioral characteristics are more likely to be abnormal users (e.g., card users), such two-level determination (or screening) of liveness analysis+behavioral characteristics analysis can more effectively identify abnormal users (e.g., card users) than do multiple screening out of order alone, and can reduce data throughput.
Fig. 2A-2B are flowcharts illustrating an example method for determining a first set of users, which may be generally obtained based on an increased number of users for a plurality of target sites, as described above, in accordance with an embodiment of the present invention.
As shown in fig. 2A, an example method 200 for determining a first set of users may implement step S101 (i.e., obtain the first set of users) as shown in fig. 1B, and may include the steps of:
step S201: operational data for a plurality of target sites is obtained from an operator database. The operational data may include a daily new number of users for each of the plurality of destination nodes over a predetermined period of time.
For example, the predetermined period of time may typically be a month or a specified number of days.
Step S202: for each target site, it is determined whether the target site satisfies a predetermined condition based on the number of newly added users per day.
The predetermined condition here may be a predetermined condition as mentioned previously in step S101. For example, the predetermined condition may be used to determine one of a plurality of dots within a channel (e.g., a channel that may be newly developed or other channel that is necessary to check to see if it is an abnormal user (e.g., a card user)) as a set of suspected abnormal users (e.g., card users). Specific examples of the predetermined condition will be described below.
Step S203: and if the target network point is determined to meet the preset condition, determining a plurality of user identifications corresponding to the target network point as a first user set. Otherwise, the process may return to step S202 to perform step S202 on the next destination node.
As shown in fig. 2B, an example method 200' for determining a first set of users may include the steps of:
step S201 and step S203 are the same as those of fig. 2A, and a description thereof is omitted here.
Steps S2021 to S2024 may implement step S202 shown in fig. 2A, and specifically include:
step S2021: and calculating the number of the first daily newly-increased users of the target network point in the first sub-time period of the preset time period and the number of the second daily newly-increased users of the target network point in the second sub-time period of the preset time period based on the number of the daily newly-increased users of the target network point in the preset time period, and comparing the ratio of the number of the first daily newly-increased users to the number of the second daily newly-increased users with a first threshold value.
For example, the predetermined period of time may be a specified month (e.g., any of 1-12 months of a specified year), the first sub-period of time may be a specified last few days (e.g., last five days) of the specified month, and the second sub-period of time may be other days of the specified month than the specified last few days (e.g., last five days). In one example, the first sub-period is taken as the last five days of a month of a year, the second sub-period is taken as the first twenty-six days of the month, and the number of new users per day is noted as X i Wherein X denotes the number of new users, the angle index i denotes the day of the month, and the number of new users on the sixth day of the month is denoted as X 6 . In this example, the average number of users newly added on the last five days (i.e., the average number of users newly added on the first day) is calculated as(i=27, 28, … 31); the daily newly added users of the first twenty-six days (i.e., the number of newly added users of the second day) are calculated as +.>(i=1, 2, …). Then, the ratio of the number of newly added users on the first day to the number of newly added users on the second day is calculated as +.>In this case, the ratio ∈>Can indicate the development quantity of the users at the end of the net pointWhether the daily average development exceeds the other time periods of the month reaches a "suspicious" level, because the end of month "impulse" is performed for better performance, the development of certain card-raising users may tend to occur at the end of the month. For example, the first threshold may be set to 1.4, i.e. indicating a level exceeding 40%. Alternatively, the ratio may be replaced with a difference between the number of newly added users on the first day and the number of newly added users on the second day, to which the present invention is not limited.
Step S2022: and calculating the number of the accumulated newly-increased users of the target network point in a third sub-time period of the preset time period, and comparing the number of the accumulated newly-increased users with a second threshold value.
For example, the third sub-period may be three days in succession arbitrarily selected within the predetermined period. In one example, taking the third sub-period as the twentieth to twenty-second days of the month of the year, the number of accumulated new users in the third sub-period is calculated as N i=20~22 =∑X i Wherein N indicates the number of newly added users, and the subscript i=20 to 22 indicates the corresponding continuous time from the twentieth day to the twenty-second day of the month. In this case, it may be indicated whether the number of intensively developed users (i.e., the number of newly added users accumulated) within the consecutive three days significantly exceeds the normal level. For example, the second threshold may be set to 50 or 100 or another number, to which the present invention is not limited.
As shown, steps S2021 and S2022 may be performed in parallel, and accordingly, subsequent steps S2023 and S2024 may also be performed in parallel, respectively. The following are provided:
step S2023: it is determined whether the calculated ratio is greater than or equal to a first threshold.
Step S2024: and judging whether the number of the accumulated newly added users is larger than or equal to a second threshold value.
As shown, steps S2023 and S2024 converge to one arrow point to step S203, which means that there is a logical OR relationship between the two judgments. That is, if the result of the comparison of steps S2021 and S2023 indicates that the difference is at least one of greater than or equal to the first threshold and the accumulated number of newly added users is greater than or equal to the second threshold, it is determined that the target mesh point satisfies the predetermined condition. That is, it is determined that the target site is a site that is suspected of having an abnormal user (e.g., a card user), and is worth further screening (determined as the first set of users).
Here, the methods 200 and 200' may be used to determine a mesh point from a plurality of mesh points as a set of users to be screened. Further, the methods 200 and 200' may be used as "preconditions" to enable hierarchical, flexible identification of abnormal users (e.g., card-keeping users) in combination with "preconditions" and "additional conditions" as previously described, which may be advantageous for improving the efficiency of the identification.
Fig. 3 is a flowchart illustrating an example method 300 for determining a second set of users, according to an embodiment of the invention.
As shown in fig. 3, an example method 300 for determining a second set of users may implement step S103 shown in fig. 1B (i.e., determining the second set of users from the first set of users based on liveness association data), in an embodiment according to the present invention, the liveness association data includes at least a call duration, a traffic size, and a number of short messages of the user indicated by the user identification, and the method 300 may include the following steps after step S102 as described above:
s1031: for each user identity in the first set of users: and calculating the weighted sum of the call duration, the flow size and the short message number of the user indicated by the user identifier.
For example, the selected user identification may indicate the user's call duration, traffic size, and number of messages may be data for a certain period of time (e.g., a certain month). Further, since the units of the respective indices are generally different, the respective indices may be normalized before calculating the weighted sum. Further, the weight of each index may be defaulted to 1, or may be set differently as needed. In this example, the weighted sum so calculated may indicate that the user's activity is high or low during the month, i.e., the higher the weighted sum, the higher the activity, and the lower the weighted sum, the lower the activity.
S1032: the calculated weighted sum is compared to a third threshold, for example, to determine whether the weighted sum is less than or equal to the third threshold.
The third threshold may be any weighted sum threshold that reflects the level of activity, shown here as an exemplary term only, and the invention is not limited in this regard.
S1033: and if the calculated weighted sum is less than or equal to the third threshold value, recording the user identification into a second user set. Otherwise, the process may return to step S1031 to perform steps S1031 to S1032 for the next user id.
Optionally, the method 300 may further include the following steps after step S1033 as described above:
s1034: and comparing the call duration, the flow size and the short message quantity of the users indicated by the user identifications in the second user set with a low call duration threshold, a low flow size threshold and a low short message quantity threshold respectively, for example, to determine whether the call duration, the flow size and the short message quantity are respectively smaller than or equal to the corresponding thresholds. If so, in step S1035, the corresponding user identifier whose call duration is less than or equal to the low call duration threshold, flow size is less than or equal to the low flow size threshold, and short message number is less than or equal to the low short message number threshold is used as the target user identifier, and is deleted from the second user set. If not, step 1304 is repeated for the next user identification in the second set of users.
In this case, the standardization may not be performed for various indexes. For example, the low talk time threshold may be set to 5 minutes, the low traffic size threshold may be set to 3M, and the low message number threshold may be set to 4. When the call duration, the flow size and the number of short messages are respectively smaller than or equal to the corresponding threshold values, the user can be determined as an extremely low-use user. When the call duration, the flow size and the number of short messages are all zero, the user can be determined as three non-users (namely, no call, no flow and no short message).
S1035: as previously described, the corresponding user identity that the call duration, the traffic size, and the number of short messages simultaneously satisfy the threshold condition (e.g., less than or equal to the respective threshold) is determined as the target user identity, and is deleted from the second user set. That is, each target user identity in the second set of users is deleted.
Through steps S1034 to S1035, the user identifier corresponding to the user with significantly low liveness (i.e., the significantly low-quality user) in the second user set may be filtered out. Such low quality users, whether or not they are card-holding users, may not be worth the operator to provide services to them. For example, the duration of the user of such low quality user's communication behavior ("three none" and "very low use") may then be analyzed for more than a certain time (e.g., three months) to further determine whether the user should be stopped from being serviced.
In one embodiment, S1034 may be performed prior to S1031 to delete user identities corresponding to "three none" and "very low use" users from the second set of users.
Fig. 4A-4C are flowcharts illustrating example methods for determining a third set of users according to embodiments of the invention.
In an embodiment according to the invention, the behavioral characteristic data may include one or more of billing revenue, number of messages, call duration, traffic size, contact number, user identity of the user indicated by the user identification. And, in addition to IMEI centralization and short-term off-network as previously described, the behavioral classification may also include contact number centralization, short turn-on time, account opening identity centralization, access base station centralization, and the like.
For example, a set of contact numbers may refer to a user having a number of contact numbers (caller pay number + callee) less than or equal to a certain threshold (e.g., 3) at a specified time (e.g., a specified month), where the contact numbers typically do not contain a customer service telephone such as 10000.
For example, a short on-time may refer to a user having an on-time less than or equal to a certain threshold (e.g., 3 days) for a specified time (e.g., a specified month), wherein the behavior of an on-time exceeding a certain threshold (e.g., 2 hours) in a natural day may be determined as an on-time.
For example, an account opening identity set may refer to users that are greater than a certain threshold (e.g., 3) for a specified time (e.g., a specified month) using the same credentials to open an account.
For example, a set of access base stations may refer to a number of base stations accessed by a user for a specified time (e.g., a specified month) that is less than or equal to a certain threshold (e.g., 3).
As shown in fig. 4A, an example method 400 for determining a third set of users may implement step S105 as shown in fig. 1B (i.e., determining a third set of users from the second set of users based on behavioral characteristic data), the method 400 may include the steps of:
step S401: behavioral characteristic data of each user in the second set of users identifying the indicated user is analyzed.
As previously described, the analysis herein may be to analyze the communication behavior characteristics of the user indicated by the user identification (e.g., categorize the communication behavior characteristics of the user) to determine one or more user identifications that satisfy a particular abnormal behavior characteristic (i.e., a particular categorized communication behavior characteristic) as a result of the analysis.
Step S402: the result of the root analysis determines a third set of users from the second set of users.
As previously described, one or more user identifications of a particular abnormal behavioral characteristic determined in step S402 may be incorporated into a third set of users as a set of suspected abnormal users (e.g., card-keeping users). In another example, the plurality of user identifications of the determined particular abnormal behavioral characteristic may also be included in the third set of users when the number of the plurality of user identifications is greater than or equal to a threshold (e.g., the particular card-keeping behavioral characteristic has a centralized performance). Here, incorporating the third set of users means recording the determined user identification into the third set of users.
In a more specific embodiment, steps S401 and S402 may be implemented differently.
As shown in fig. 4B, steps S401 and S402 may be implemented with an example method 400'. The method 400' includes the steps of:
steps S4011 to S4013 can implement step S401 shown in fig. 4A.
Step S4011: a behavioral characteristic score is calculated for each user in the second set of users that identifies the indicated user.
According to embodiments of the present invention, behavioral characteristic scores may be calculated based on behavioral characteristic data for evaluation, analysis, classification, or the like of behavioral characteristics. For example, the behavioral characteristic score S may be a normalized representation of behavioral characteristic data, such as (S- μ)/σ, where μ is the average of the behavioral characteristic data (e.g., relative to a particular time period of a month) and σ is the variance of the behavioral characteristic data. However, the method for calculating the behavior feature score is not limited in any way, and other methods for measuring the behavior feature data are also possible.
Step S4012: and deducing the similarity of the behavior characteristics between the users indicated by each two user identifications from the calculated behavior characteristic scores of the users indicated by each user identification.
According to an embodiment of the present invention, the behavioral characteristic similarity is calculated based on the behavioral characteristic scores of each two users for characterizing the degree of behavioral similarity between the two users. In one example, behavioral characteristic similarity may be described directly in terms of a similarity distance (e.g., euclidean distance). In this case, the distance d between the user a and the user b can be expressed as:where j is an index indicating the type of behavioral characteristic score. For example, j=1 indicates that the behavior feature score is an account score, j=2 indicates that the behavior feature score is a text message score, and may be set as needed. Further, in this case, the lower the calculated behavior feature similarity a (i.e., the distance characterizing the similarity), the higher the similarity of the selected behavior features between the two users. In another example, the similarity of the selected behavioral characteristics between two users may also be represented by a difference between 1 and the normalized similarity distance value. In this case, it is more directly reflected in the numerical value, calculatedThe higher the behavioral characteristic similarity a (i.e., (1-distance value)), the higher the similarity of the selected behavioral characteristics between the two users.
Step S4013: and for the selected user identification in the second user set, taking the user identifications of other users with the behavioral characteristic similarity of the user indicated by the selected user identification within the behavioral characteristic similarity threshold as the associated user identifications of the selected user identification.
According to an embodiment of the present invention, for a selected user identification, step S4013 is performed to calculate phase feature similarities for a certain behavioral feature data/score between the selected user identification and other user identifications within the same user set. For example, for the short message score of user id a (e.g., represented by j=2), the short message scores of other user ids b-z in the second user set may be calculated, and the short message score similarity a may be calculated one by one 2 (a,b)、A 2 (a,c)、……、A 2 (a, z) (e.g., calculated with a (1-normalized similarity distance value)). Then, for example, the behavior feature similarity threshold of the short message score is set to 0.97, then A 2 (a,b)-A 2 And (3) the user identification corresponding to the similarity which is greater than or equal to 0.97 in (a, z) is the associated user identification which needs to be recorded. For example, if A therein 2 (a,c)、A 2 (a,f)、A 2 (a,g)、A 2 (a,m)、A 2 (a, n) and A 2 (a, r) is greater than or equal to 0.97, then the user identities c, f, g, m, n and r are recorded as associated user identities. Various behavioral characteristic data/scores may also be selected to calculate overall (e.g., weighted) similarity, according to embodiments of the invention.
Steps S4021 to S4023 may implement step S402 as shown in fig. 4A.
Step S4021: the number of associated user identities is calculated.
In the above example, the user identities c, f, g, m, n and r are accumulated as 6 for the short message score.
Step S4022: the sum of the calculated number of associated user identities plus one is compared to a predetermined number threshold, for example, to determine if the number of associated user identities plus one is greater than or equal to the number threshold.
Step S4023: if it is determined that the sum of the calculated number of associated user identities plus one is greater than or equal to the predetermined number threshold, the selected user identity and its associated user identities are recorded together in a third user set.
In case the predetermined number threshold is 5, the number of associated user identities 6+1=7 accumulated as above is compared with 5. Since 7 > 5, the corresponding user identities c, f, g, m, n and r and a are recorded into the third user set.
Here, an example similarity algorithm associated with card user identification is additionally provided for implementing steps S4012 through S4013 and steps S4022 through S4023: and determining that all similar users (including selected users) in the same user set are suspected card-raising users according to the new network-entering user network-entering monthly billing cost, the calling times, the use flow and the short message sending times, wherein the number of the similar users in the same user set is more than or equal to 5. A specific example computational flow may be as follows:
Step 1: index selection and normalization (corresponding to step S4011). For example, according to the service type, the average value and standard deviation of the income of the newly developed user, the number of calls, the use flow and the number of short messages are selected and calculated. Then, the account income, the number of calls, the use of high traffic and the score of the number of short messages are calculated, respectively. For example, the posting income score= (posting income-posting income average)/posting income standard deviation.
Step 2: user similarity between the selected user and other users in the same user set (for example, users in the same channel or the same website or other ranges) is calculated (corresponding to steps S4012 to S4013). For example, for user a, the overall similarity between him and user b may be calculated as follows:
in this example, the higher the overall similarity calculated, the closer the common communication behavior between users, such as revenue, calls, traffic, messages, etc., is, possibly, card-raising users who simulate the communication behavior in batches using algorithms executed on hardware.
Step 3: a behavior-similar user (i.e., an associated user identification) is determined (corresponding to steps S4021 to S4023). For example, if the similarity between two users is greater than or equal to a certain threshold (e.g., 0.97, in which case the distance therebetween is less than 0.03), then the two users are determined to be similar. Then, if the aggregate number of users in the same set of users that are similar to the selected user plus the selected user (i.e., all similar users for that behavioral characteristic data/score) exceeds a threshold number (e.g., 5), then the selected user and the users that are similar to their behavioral characteristics are determined to be suspected card-fostering users.
The algorithm of similarity and associated method of determining (screening) the third user set may vary depending on the behavior feature data selected or the combination of behavior feature data, and the invention is not limited in this regard.
As shown in fig. 4C, steps S401 and S402 may be implemented with an example method 400 ". The method 400 "includes the steps of:
steps S4014 to S4015 may implement step S401 as shown in fig. 4A.
Step S4014: an IMEI of each user identification indicated user in the second set of users is extracted.
In this example, the behavior feature data of the user indicated by the user identity is defined as IMEI, and a search may be performed from the behavior feature data with the user identity as an index to extract a corresponding set of IMEIs (e.g., from a server of the operator). Since IMEI is data identifying types, similarity comparisons between users can be made bitwise for IMEI, and in one example, only by comparison, it is determined whether the IMEI between users is the same or different.
Step S4015: for a selected user identity in the second set of users, user identities of other users having the same IMEI as the user indicated by the selected user identity are taken as associated user identities.
Here, multiple users of the same IMEI illustrate that the users are active and communicating in the same hardware. When the number of users of the same IMEI (here the number of selected user identities plus other user identities that are the same as their IMEI) is greater than or equal to a certain threshold (e.g. 5), it is stated that the "same hardware" is most likely a "cat pool" rather than a "dual card dual standby" handset commonly used by general users. This situation may be referred to as "IMEI set".
Steps S4021 to S4023 may implement step S402 shown in fig. 4A, which is the same as fig. 4B, and thus a description thereof is omitted.
The set of suspected abnormal users (e.g., card-keeping users) is further determined by determining that the number of users for a particular abnormal behavioral characteristic reaches a threshold (e.g., a set of IMEIs). Because the method of fig. 4A-4C may be performed after the method of fig. 3 (and optionally the method of fig. 2A-2B) as shown in fig. 1B, hierarchical, flexible recognition of abnormal users (e.g., card-keeping users) may be implemented to sequentially perform layer-by-layer screening in the order of "preconditions", and "additional conditions" (where "preconditions" may be optional), which is advantageous for reducing the amount of data processed, speeding up the processing, and improving the accuracy of recognition.
Fig. 5 is a schematic diagram showing behavior similarity, which is described in terms of usage traffic and number of calls and measured in terms of distance, according to an embodiment of the present invention.
As shown in fig. 5, the number of calls indicated by the horizontal axis increases in the positive direction, and the usage flow indicated by the vertical axis increases in the positive direction. Each scattered point in the graph indicates a user, and the location of the point indicates characteristics of both the user's usage flow and the number of calls. Accordingly, the closer the distance between points, the higher the degree of similarity between users indicated by the points. As the 7 users circled in the figure are users with high similarity (e.g., the similarity distance between each other is smaller than a certain threshold), and the number of users in the area (i.e., within the behavioral characteristic similarity threshold) is greater than a certain threshold (e.g., 5), then the 7 users may be determined to be suspected abnormal users (e.g., card-keeping users).
Alternatively, the similarity may be indicated in a three-dimensional space or even more (when 3 or more pieces of behavior feature data are used to indicate the similarity of behavior features), which is not limited in any way by the present invention.
Fig. 6 is a block diagram illustrating an example apparatus 600 for abnormal user (e.g., card keeping user) identification in accordance with an embodiment of the present invention.
As shown in fig. 6, the example apparatus 600 may include a first filtering unit 601, a second filtering unit 602, and a third filtering unit 603 for performing three rounds of filtering on input data to achieve hierarchical filtering, thereby improving recognition efficiency and accuracy and reducing the amount of processed data.
Specifically, the first screening unit 601 may implement step S101 shown in fig. 1B or steps S201 to S203 shown in fig. 2A-2B with the operation data as input, so as to generate the first user set to the second screening unit 602. The second filtering unit 602 may perform steps S102 to S103 shown in fig. 1B or steps S1031 to S1035 shown in fig. 3 with the first user set as input to generate the second user set to the third filtering unit 603. The third filtering unit 603 may implement steps S104 to S105 shown in fig. 1B or steps S401 to S402 shown in fig. 4A to 4C with the second user set as input to generate a third user set as a final identified abnormal user (e.g., card keeping user) set.
Furthermore, various methods and apparatus as described above may be implemented using neural networks for machine learning and training to optimize parameters in accordance with embodiments of the present invention. However, the invention is not limited to the type of neural network, and may generally be a recurrent neural network. For example, individual threshold parameters or parameter combinations of the neural network may be trained based on minimizing one or more of a difference between an actual number of abnormal users included in the first set of users and a total number of users of the first set of users, a difference between an actual number of abnormal users included in the second set of users and a total number of users of the second set of users, and a difference between an actual number of abnormal users included in the third set of users and a total number of users of the third set of users. That is, the neural network may be trained based on the set of each round of screening containing a greater number of actual outlier users (i.e., increasing the efficiency of each screening). Alternatively, training may also be performed based on the screening speed (time difference between input result and output result). Also, the various elements of the example apparatus may be implemented separately with a neural network or may be implemented as a whole with a neural network.
Fig. 7 is a block diagram illustrating a smart device 700 according to an embodiment of the present invention.
As shown in fig. 7, a smart device 700 may include a storage 701 and a processor 702. The storage device 701 is used to store a computer program. The processor 702 runs a stored computer program for implementing the various methods as described above.
The storage 701 may include volatile memory (RAM), such as random-access memory (RAM); the storage device may also include a non-volatile memory (non-volatile memory), such as a flash memory (flash memory), a Solid State Drive (SSD), etc.; the storage means may also comprise a combination of memories of the kind described above.
The processor 702 may be a central processing unit (central processing unit, CPU). The processor may further comprise a hardware chip. The hardware chip may be an application-specific integrated circuit (ASIC), a programmable logic device (programmable logic device, PLD), or the like. The PLD may be a field-programmable gate array (FPGA), general-purpose array logic (generic array logic, GAL), or the like.
Optionally, the storage device is further configured to store program instructions. The processor may invoke the program instructions to implement the method as shown in the embodiments of fig. 1B-4C of the present application.
Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the method provided by the foregoing embodiment.
The steps in the method of the embodiment of the application can be sequentially adjusted, combined and deleted according to actual needs.
The modules in the device of the embodiment of the application can be combined, divided and deleted according to actual needs.
Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored on a computer readable storage medium, which when executed may comprise the steps of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), or the like.
The above disclosure is only a few examples of the present invention, and it is not intended to limit the scope of the present invention, but it is understood by those skilled in the art that all or a part of the above embodiments may be implemented and equivalents thereof may be modified according to the scope of the present invention.