CN111506615A

CN111506615A - Method and device for determining occupation degree of invalid user

Info

Publication number: CN111506615A
Application number: CN202010321445.5A
Authority: CN
Inventors: 高文辉
Original assignee: WeBank Co Ltd
Current assignee: WeBank Co Ltd
Priority date: 2020-04-22
Filing date: 2020-04-22
Publication date: 2020-08-07

Abstract

The invention discloses a method and a device for determining the occupancy degree of an invalid user, wherein the method comprises the following steps: acquiring user data of a reference user group and user data of a user group to be detected; determining similarity indexes of data distribution of the user data of the reference user group and the user data of the user group to be detected in the data characteristic dimension aiming at each data characteristic dimension in a plurality of data characteristic dimensions; and determining the occupation degree of invalid users of the user group to be tested according to the similarity index of each data characteristic dimension in the plurality of data characteristic dimensions. When the method is applied to financial technology (Fintech), even if the user is not in the invalid user list, the occupation degree of the invalid user of the user group to be detected can be determined by comparing the similarity with the user data of the invalid user group or the valid user group, so that the occupation degree of the invalid user can be determined more accurately.

Description

Method and device for determining occupation degree of invalid user

Technical Field

The invention relates to the field of information security in the field of financial technology (Fintech), in particular to a method and a device for determining the occupation degree of an invalid user.

Background

With the development of computer technology, more and more technologies are applied in the financial field, and the traditional financial industry is gradually changing to financial technology (Fintech), but because of the requirements of the financial industry on safety and real-time performance and higher requirements put forward on the technologies, for a financial institution, how to popularize corresponding financial instruments such as financial mobile phone software (Application) is crucial, and the effective number of registered users is an important index for evaluating the popularization degree of the financial instruments.

However, in the current process of popularizing financial instruments, there may be a risk that registered users are flooded with a large number of invalid users, such as a large number of users in wool, which occupy a large amount of popularization resources, and how to discover the risk in time is of great significance to the popularization of financial instruments. In the current method, registered users in the financial instrument are matched with an invalid user list, the occupancy degree of the invalid users is determined, and if the occupancy degree exceeds a preset proportion, a large number of invalid users are determined. Obviously, in the current method, the coverage of the list of the invalid users is limited, and it is difficult to accurately determine the occupancy degree of the invalid users.

Disclosure of Invention

The invention provides a method and a device for determining the occupation degree of an invalid user, which solve the problem that the occupation degree of the invalid user is difficult to accurately determine in the prior art.

In a first aspect, the present invention provides a method for determining occupancy of an invalid user, including: acquiring user data of a reference user group and user data of a user group to be detected; the reference user group is an invalid user group or an effective user group; the user data of the reference user group and the user data of the user group to be detected both comprise data of a plurality of data characteristic dimensions; determining similarity indexes of data distribution of the user data of the reference user group and the user data of the user group to be detected in the data characteristic dimension aiming at each data characteristic dimension in a plurality of data characteristic dimensions; and determining the occupation degree of invalid users of the user group to be tested according to the similarity index of each data characteristic dimension in the plurality of data characteristic dimensions.

According to the method, after the user data of the reference user group and the user data of the user group to be detected are obtained, the similarity of data distribution between the user data of the reference user group and the user data of the user group to be detected under the data characteristic dimension can be determined by comparing the similarity index of each data characteristic dimension in a plurality of data characteristic dimensions, so that even if the user is not in an invalid user list, the occupation degree of the invalid user of the user group to be detected can be determined by comparing the similarity of the user data of the invalid user group or the user data of the valid user group, and the occupation degree of the invalid user can be determined more accurately.

Optionally, the determining, according to the similarity index of each data feature dimension in the plurality of data feature dimensions, the occupancy degree of the invalid user of the user group to be tested includes: determining an occupation index of the occupation degree of invalid users of the user group to be tested under the data characteristic dimension according to the similarity index of each data characteristic dimension in the data characteristic dimensions; the occupation index is used for representing the occupation degree of invalid users of the user group to be detected under the data characteristic dimension; determining the occupation index of the occupation degree of the invalid users of the user group to be detected under the plurality of data characteristic dimensions according to the occupation index of the occupation degree of the invalid users of the user group to be detected under each data characteristic dimension in the plurality of data characteristic dimensions and the weight value of the data characteristic dimension; and determining the occupation degree of the invalid users of the user group to be tested according to the occupation indexes of the occupation degrees of the invalid users of the user group to be tested under the multiple data characteristic dimensions.

According to the method, firstly, according to the similarity index of each data characteristic dimension in the plurality of data characteristic dimensions, the occupation index of the occupation degree of the invalid users of the user group to be detected under the data characteristic dimension is determined, so that the occupation degree of the invalid users of the user group to be detected under the data characteristic dimension is represented firstly, the occupation index of the occupation degree of the invalid users of the user group to be detected under the condition of integrating the plurality of data characteristic dimensions is represented, the occupation degree of the invalid users under the condition of integrating the plurality of data characteristic dimensions is represented, and therefore the occupation degree of the invalid users of the user group to be detected is accurately determined by comprehensively considering the plurality of data characteristic dimensions and the corresponding weight values.

Optionally, the weight value of each of the plurality of data feature dimensions is inversely related to the degree of dispersion of the user data of the reference user group in the data feature dimension.

In the method, for the user data of the reference user group and the user data of the user group to be detected, the smaller the dispersion degree in one data characteristic dimension is, the lower the difference between the data is, so that the similarity index in the data characteristic dimension is more sensitive to the difference of the data, and the occupation degree of invalid users can be more obviously reflected by setting a negative correlation relationship to the weight value.

Optionally, the weight value of each data feature dimension in the plurality of data feature dimensions is specifically calculated according to the following method: determining, for each of the plurality of data feature dimensions, a standard deviation of the user data of the reference user group in the data feature dimension, thereby characterizing a degree of dispersion of the user data of the reference user group in the data feature dimension; determining the weight value of the data characteristic dimension according to the reciprocal of the standard deviation of the user data of the reference user group under the plurality of data characteristic dimensions; the weighted value of the data characteristic dimension is positively correlated with the reciprocal of the standard deviation of the user data of the reference user group in the data characteristic dimension.

In the above method, the dispersion degree of the user data of the reference user group in the data feature dimension is represented by the standard deviation of the user data of the reference user group in the data feature dimension, and the weight value of the data feature dimension is determined according to the inverse of the standard deviation of the user data of the reference user group in the data feature dimensions, so that a method for determining the weight value of each data feature dimension in the data feature dimensions according to the standard deviation and the inverse of the standard deviation is provided.

Optionally, the occupancy level is high occupancy or low occupancy; and the occupation index of the occupation degree of the invalid users of the user group to be detected in the data characteristic dimension is the probability that the occupation degree of the invalid users of the user group to be detected in the data characteristic dimension is high or low.

In the mode, the occupation degree is represented by the probability of high occupation or the probability of low occupation, so that the occupation degree of invalid users of the user group to be tested is more intuitively represented.

Optionally, for each data feature dimension in the multiple data feature dimensions, determining a similarity index of data distribution of the user data of the reference user group and the user data of the user group to be tested in the data feature dimension; the method comprises the following steps: determining, for each of a plurality of data feature dimensions, a subset of a plurality of categories under the data feature dimension; determining the number of users of each category of the reference user group and the user data of the to-be-detected user group in the plurality of categories of the to-be-detected user group according to the user data of the reference user group and the user data of the to-be-detected user group; and determining similarity indexes of data distribution of the user data of the reference user group and the user data of the user group to be detected under the data characteristic dimension according to the number of the users of the sub group of each category of the reference user group and the sub group of the user group to be detected in the plurality of categories.

In the above manner, each data characteristic dimension is subdivided into subgroups of multiple categories, and the number of users in each subgroup of the reference user group and the user group to be tested is determined, so that the similarity index of the data distribution of the user data of the reference user group and the user data of the user group to be tested in the data characteristic dimension is determined by considering the subgroups of the multiple categories.

Optionally, the similarity index of the data distribution under the data feature dimension is: and under the data characteristic dimension, the Pearson correlation coefficient or cosine similarity of the user data of the reference user group and the user data of the user group to be detected.

In the above manner, the pearson correlation coefficient or the cosine similarity can represent the similarity between the user data of the reference user group and the user data of the user group to be detected.

Optionally, the plurality of data feature dimensions include at least one of: the age bracket of the user; the ID card number of the user is affiliated to the place; the location of the user's mobile phone number home; the user's bank card number home; the user's internet protocol IP address home.

Since the frequent expression of the invalid users is in the concentration of age groups or the concentration of various number attributions, the occupation degree of the invalid users can be more accurately determined by focusing on the plurality of data characteristic dimensions.

In a second aspect, the present invention provides an occupancy level determination device for an invalid user, comprising: the acquisition module is used for acquiring the user data of the reference user group and the user data of the user group to be detected; the reference user group is an invalid user group or an effective user group; the user data of the reference user group and the user data of the user group to be detected both comprise data of a plurality of data characteristic dimensions; the processing module is used for determining a similarity index of data distribution of the user data of the reference user group and the user data of the user group to be detected in the data characteristic dimension aiming at each data characteristic dimension in a plurality of data characteristic dimensions; and the method is used for determining the occupancy degree of the invalid users of the user group to be tested according to the similarity index of each data characteristic dimension in the plurality of data characteristic dimensions.

Optionally, the processing module is specifically configured to: determining an occupation index of the occupation degree of invalid users of the user group to be tested under the data characteristic dimension according to the similarity index of each data characteristic dimension in the data characteristic dimensions; the occupation index is used for representing the occupation degree of invalid users of the user group to be detected under the data characteristic dimension; determining the occupation index of the occupation degree of the invalid users of the user group to be detected under the plurality of data characteristic dimensions according to the occupation index of the occupation degree of the invalid users of the user group to be detected under each data characteristic dimension in the plurality of data characteristic dimensions and the weight value of the data characteristic dimension; and determining the occupation degree of the invalid users of the user group to be tested according to the occupation indexes of the occupation degrees of the invalid users of the user group to be tested under the multiple data characteristic dimensions.

Optionally, the processing module is specifically configured to: determining, for each of a plurality of data feature dimensions, a subset of a plurality of categories under the data feature dimension; determining the number of users of each category of the reference user group and the user data of the to-be-detected user group in the plurality of categories of the to-be-detected user group according to the user data of the reference user group and the user data of the to-be-detected user group; and determining similarity indexes of data distribution of the user data of the reference user group and the user data of the user group to be detected under the data characteristic dimension according to the number of the users of the sub group of each category of the reference user group and the sub group of the user group to be detected in the plurality of categories.

The advantageous effects of the second aspect and the various optional apparatuses of the second aspect may refer to the advantageous effects of the first aspect and the various optional methods of the first aspect, and are not described herein again.

In a third aspect, the present invention provides a computer device comprising a program or instructions for performing the method of the first aspect and the alternatives of the first aspect when the program or instructions are executed.

In a fourth aspect, the present invention provides a storage medium comprising a program or instructions which, when executed, is adapted to perform the method of the first aspect and the alternatives of the first aspect.

Drawings

Fig. 1 is a schematic flowchart illustrating steps of a method for determining occupancy of an invalid user according to an embodiment of the present disclosure;

fig. 2 is a schematic structural diagram of an occupancy level determination device for an invalid user according to an embodiment of the present application.

Detailed Description

In order to better understand the technical solutions, the technical solutions will be described in detail below with reference to the drawings and the specific embodiments of the specification, and it should be understood that the specific features in the embodiments and examples of the present application are detailed descriptions of the technical solutions of the present application, but not limitations of the technical solutions of the present application, and the technical features in the embodiments and examples of the present application may be combined with each other without conflict.

The following first lists the nouns appearing in the examples of the present application.

The effective user: the user of the App is registered for normal use purposes.

And (4) invalid users: users who register App for purposes of abnormal use, such as users playing wool, register App primarily for the purpose of obtaining a certain promotional activity reward, and then rarely log in to App.

And (3) real name registration: refers to user information submitted when registering an App.

Pearson correlation test: whether two groups of data are obviously correlated or not is checked by calculating a correlation coefficient of the two groups of data X, Y with the sample size of n, the value range of the correlation coefficient is [ -1,1], the closer the value is to 1, the stronger the positive linear correlation of the two groups of data is, the closer the value is to-1, the stronger the negative linear correlation of the two groups of data is, the closer the value is to 0, the more no linear correlation of the two groups of data is, and the specific definition of the correlation coefficient is as follows

In the operation process of a financial institution (a banking institution, an insurance institution or a security institution) in business (such as loan business, deposit business and the like of a bank), how to popularize a corresponding financial tool App to an effective user is of great significance to the financial institution. However, a large number of invalid users may be registered in the current financial instrument promotion process, and the invalid users waste more promotion resources, so how to determine the occupation degree of the invalid users is a problem of great concern. In the conventional method, the occupancy degree of the invalid user is determined by matching the registered user in the financial instrument with the invalid user list, but the coverage of the invalid user list is limited, and it is difficult to accurately determine the occupancy degree of the invalid user. This situation does not meet the requirements of financial institutions such as banks, and the efficient operation of various services of the financial institutions cannot be ensured.

For this reason, as shown in fig. 1, the embodiment of the present application provides a method for determining the occupancy level of an invalid user.

Step 101: and acquiring the user data of the reference user group and the user data of the user group to be tested.

Step 102: and determining similarity indexes of data distribution of the user data of the reference user group and the user data of the user group to be detected in the data characteristic dimension aiming at each data characteristic dimension in a plurality of data characteristic dimensions.

Step 103: and determining the occupation degree of invalid users of the user group to be tested according to the similarity index of each data characteristic dimension in the plurality of data characteristic dimensions.

In steps 101 to 103, the meaning of the user may be a registered user, and the reference user group is an invalid user group or a valid user group. The invalid user group may be a user group in which users include more (e.g., exceed an invalid user ratio threshold) invalid users, and the valid user group may be a user group in which users include more (e.g., exceed a valid user ratio threshold) valid users.

Specifically, the invalid user group or the valid user group may be determined according to the occurrence of the invalid user in the user group within the preset historical time range. The reference user group (e.g. labeled as user _ group0) may be an invalid user group or a valid user group, and whether a suspected large number of invalid users exist is detected as the user group to be detected (e.g. labeled as user _ group1), for example, all users registered on the nth day are used as user _ group1, and all users registered on the nth-7 th day are used as user _ group 0.

The user data of the reference user group and the user data of the user group to be tested both comprise data of a plurality of data characteristic dimensions. In particular, a data feature dimension may refer to some aspect of a property used to describe a set of user data. For example, the data feature dimension may be an age of the user, may be an online duration of the user, and the like. Under each data characteristic dimension in a plurality of data characteristic dimensions, the similarity index of the user data of the reference user group and the user data of the user group to be detected represents the data similarity between the user data of the reference user group and the user data of the user group to be detected, and the similarity index represents that the more similar the user data of the reference user group and the user data of the user group to be detected (such as an invalid user group), the higher the occupation degree of invalid users in the user group to be detected is, or the lower the occupation degree of valid users in the user group to be detected is. And the same is true for the condition that the user group to be tested is an effective user group. In addition, the occupancy degree can define specific values, for example, the occupancy degree is high or low, further, more values (for example, multiple occupancy levels) can be defined to represent the occupancy degree of an effective user or an invalid user, and a mapping relationship between the value of the similarity index and the occupancy degree value can be established.

It should be noted that, in general, an invalid user may exhibit an aggregation characteristic in some data feature dimensions during registration, for example, the age of the registered user is relatively large, the identification number used for registration is concentrated in some places, and the IP address reported during registration is concentrated in some places.

Thus, a particular data feature dimension may be selected as the plurality of data feature dimensions, which in an alternative embodiment may include at least one of: the age bracket of the user; the ID card number of the user is affiliated to the place; the location of the user's mobile phone number home; the user's bank card number home; the user's internet protocol IP address home.

An alternative implementation of step 102 is as follows:

determining, for each of a plurality of data feature dimensions, a subset of a plurality of categories under the data feature dimension; determining the number of users of each category of the reference user group and the user data of the to-be-detected user group in the plurality of categories of the to-be-detected user group according to the user data of the reference user group and the user data of the to-be-detected user group; and determining similarity indexes of data distribution of the user data of the reference user group and the user data of the user group to be detected under the data characteristic dimension according to the number of the users of the sub group of each category of the reference user group and the sub group of the user group to be detected in the plurality of categories.

Specifically, as an example of the above optional implementation mode of selecting a plurality of data feature dimensions, the plurality of data feature dimensions may be obtained by processing in the following manner:

a) statistics of the number of users of a subgroup of a plurality of categories of the age group of users: the ages of the users at the time of real-name registration are grouped according to age groups, such as 20 years old and below, 21-30 years old, 31-40 years old, 41-50 years old, 51-60 years old, 61-70 years old, 71-80 years old, 81-90 years old, 91 years old and above, and the number of users in each subgroup of each category of each age group is counted and can be recorded as age _ cnt.

b) Counting the number of users of the subgroup of the multiple categories of the attribution of the identification number of the user: and intercepting the first 6 relevant digits of the identification number to obtain the city to which the identification number belongs, counting the number of users in each category of subgroup of the attributive place of the identification number, and marking as idro _ attr _ cnt.

c) Counting the number of users of subgroups of multiple categories of the mobile phone number attribution places of the users: intercepting the first k (such as 7) digits of the mobile phone number to obtain the mobile phone number home city, if the mobile phone number is the virtual operator mobile phone number, such as beginning of 147, 170, 171, dividing the home location into a special virtual operator class subgroup (also regarded as a city class subgroup), counting the number of users in each class subgroup of the mobile phone number home location, and recording as phone _ attr _ cnt.

d) Counting the number of users of a subgroup of a plurality of categories of the place to which the bank card number of the user belongs: and associating the bank card number to obtain a bank card number attribution, and counting the number of users in each sub-group of the bank card number attribution, and recording the number as card _ attr _ cnt.

e) Counting a number of users of a subset of a plurality of categories to which a user's Internet Protocol (IP) address belongs: and (4) associating the IP addresses to obtain IP address attribution, and counting the number of users of each category of subgroup of the IP addresses, and recording as IP _ attr _ cnt.

Simultaneously aiming at the user data of the user group to be tested and referring to the same data characteristic of the user data of the user group, if one group of users has a certain subgroup, but the other group of users does not have a corresponding subgroup, the corresponding number of the group of users on the subgroup is filled with 0, such as two 0 in table 1:

TABLE 1

It should be noted that, in an optional implementation manner of steps 101 to 103, the similarity index of the data distribution in the data feature dimension may be: and under the data characteristic dimension, the Pearson correlation coefficient or cosine similarity of the user data of the reference user group and the user data of the user group to be detected. The similarity index ρ of the user data of the reference user group and the user data of the user group to be tested in the data characteristic dimension of the age group of the user in step 102 is described below by taking the pearson correlation coefficient and taking the reference user group as an effective user group as an example_age：

Wherein age _ cnt0_iTo refer to the number of users in the subgroup of the i-th age group of the user group,

age _ cnt1 as an average number of users in a subgroup of n age groups of a reference user group_iThe number of users in the subgroup of the ith age group of the user group to be tested,

the average number of users in the subgroup of n age groups of the user group to be tested.

An alternative implementation of step 103 is as follows:

step (103-1): and determining an occupation index of the occupation degree of invalid users of the user group to be tested under the data characteristic dimension according to the similarity index of each data characteristic dimension in the data characteristic dimensions.

Step (103-2): and determining the occupation index of the occupation degree of the invalid users of the user group to be detected under the plurality of data characteristic dimensions according to the occupation index of the occupation degree of the invalid users of the user group to be detected under each data characteristic dimension in the plurality of data characteristic dimensions and the weight value of the data characteristic dimension.

Step (103-3): and determining the occupation degree of the invalid users of the user group to be tested according to the occupation indexes of the occupation degrees of the invalid users of the user group to be tested under the multiple data characteristic dimensions.

In the step (103-1), the occupancy index is used for representing the occupancy degree of invalid users of the user group to be tested under the data characteristic dimension. In an alternative embodiment, the degree of occupancy is high occupancy or low occupancy; and the occupation index of the occupation degree of the invalid users of the user group to be detected in the data characteristic dimension is the probability that the occupation degree of the invalid users of the user group to be detected in the data characteristic dimension is high or low.

In an alternative embodiment where the occupancy level is a probability, ρ is the similarity index_ageIn the example of the optional implementation manner in (3), the step (103-1) obtains the probability P that the occupation degree of the invalid users of the user group to be tested is high in the data characteristic dimension of the age group of the users_ageAn example of (A) can be as follows, P_ageThe value range is [0, 1]]Specifically:

according to the definition of the Pearson correlation coefficient, under a certain data characteristic dimension (such as the age of a user), the stronger the forward linear correlation between the user data of the reference user group and the user data of the user group to be tested is, the more similar the distribution of the user data of the reference user group (effective user group) and the user data of the user group to be tested is, the more P is_ageThe smaller the data characteristic dimension of the age group of the user, the lower the possibility that the occupation degree of the invalid user of the user group to be tested is high, and the corresponding probabilities of other data characteristic dimensions are similar.

Further, based on the above example of step (103-1), the user's information can be obtained in orderAge group; the ID card number of the user is affiliated to the place; the location of the user's mobile phone number home; the user's bank card number home; the occupation degree of the invalid users of the user group to be tested belonging to the underground of the Internet protocol IP address of the user is high occupation probability: p_age、P_idno、P_phone、P_card、P_ipAnd according to the weight value of each data characteristic dimension (corresponding to w respectively)_age、w_idno、w_phone、w_card、w_ip) Step (103-2) is executed specifically, S ═ w_age*P_age+w_idno*P_idno+w_phone*P_phone+w_card*P_card+w_ip*P_ip. And S is an occupation index of the occupation degree of the invalid users of the user group to be detected under the multiple data characteristic dimensions.

In an alternative embodiment, the weight value of each data feature dimension may be set as follows: the weight value of each data characteristic dimension in the plurality of data characteristic dimensions is in negative correlation with the discrete degree of the user data of the reference user group under the data characteristic dimension.

As the smaller the dispersion degree of the user data of the reference user group and the user data of the user group to be detected is, the lower the difference between the data is, the more sensitive the similarity index of the data characteristic dimension is to the difference of the data, so that the occupation degree of invalid users can be reflected more obviously by setting a negative correlation relationship to the weight value.

It should be noted that, the weight value of each data feature dimension in the multiple data feature dimensions is specifically calculated as follows:

determining, for each of the plurality of data feature dimensions, a standard deviation of the user data of the reference user group in the data feature dimension, thereby characterizing a degree of dispersion of the user data of the reference user group in the data feature dimension; determining the weight value of the data characteristic dimension according to the reciprocal of the standard deviation of the user data of the reference user group under the plurality of data characteristic dimensions; the weighted value of the data characteristic dimension is positively correlated with the reciprocal of the standard deviation of the user data of the reference user group in the data characteristic dimension.

Taking the data characteristic dimension as the age group of the user as an example, the weight value w corresponding to the age group of the user_ageThis may be determined as follows:

wherein sigma_age、σ_idno、σ_phone、σ_card、σ_ipThe data of the reference user group are sequentially in the age group of the user; the ID card number of the user is affiliated to the place; the location of the user's mobile phone number home; the user's bank card number home; standard deviation under the characteristic dimension of the internet protocol IP address attribution data of the user.

Thus w_age+w_idno+w_phone+w_card+w_ip1, S has a value in the range of [0, 1]]. And a preset alarm threshold corresponding to the S can be set, when the S is greater than the preset alarm threshold, an alarm is given, an invalid user with a high occupation ratio exists in the user group to be detected, and meanwhile, the more the S value approaches to 1, the more serious the phenomenon of invalid user registration is.

As shown in fig. 2, the present invention provides an occupancy level determination device for an invalid user, comprising: an obtaining module 201, configured to obtain user data of a reference user group and user data of a user group to be detected; the reference user group is an invalid user group or an effective user group; the user data of the reference user group and the user data of the user group to be detected both comprise data of a plurality of data characteristic dimensions; a processing module 202, configured to determine, for each data feature dimension of multiple data feature dimensions, a similarity index of data distribution of the user data of the reference user group and the user data of the user group to be tested in the data feature dimension; and the method is used for determining the occupancy degree of the invalid users of the user group to be tested according to the similarity index of each data characteristic dimension in the plurality of data characteristic dimensions.

Optionally, the processing module 202 is specifically configured to: determining an occupation index of the occupation degree of invalid users of the user group to be tested under the data characteristic dimension according to the similarity index of each data characteristic dimension in the data characteristic dimensions; the occupation index is used for representing the occupation degree of invalid users of the user group to be detected under the data characteristic dimension; determining the occupation index of the occupation degree of the invalid users of the user group to be detected under the plurality of data characteristic dimensions according to the occupation index of the occupation degree of the invalid users of the user group to be detected under each data characteristic dimension in the plurality of data characteristic dimensions and the weight value of the data characteristic dimension; and determining the occupation degree of the invalid users of the user group to be tested according to the occupation indexes of the occupation degrees of the invalid users of the user group to be tested under the multiple data characteristic dimensions.

Optionally, the processing module 202 is specifically configured to: determining, for each of a plurality of data feature dimensions, a subset of a plurality of categories under the data feature dimension; determining the number of users of each category of the reference user group and the user data of the to-be-detected user group in the plurality of categories of the to-be-detected user group according to the user data of the reference user group and the user data of the to-be-detected user group; and determining similarity indexes of data distribution of the user data of the reference user group and the user data of the user group to be detected under the data characteristic dimension according to the number of the users of the sub group of each category of the reference user group and the sub group of the user group to be detected in the plurality of categories.

Embodiments of the present application provide a computer device, which includes a program or instructions, and when the program or instructions are executed, the program or instructions are used to execute an occupancy level determination method for invalid users and any optional method provided by embodiments of the present application.

Embodiments of the present application provide a storage medium, which includes a program or an instruction, and when the program or the instruction is executed, the program or the instruction is used to execute a method for determining occupancy of an invalid user and any optional method provided by embodiments of the present application.

Finally, it should be noted that: as will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A method for determining occupancy of an invalid user, comprising:

acquiring user data of a reference user group and user data of a user group to be detected; the reference user group is an invalid user group or an effective user group; the user data of the reference user group and the user data of the user group to be detected both comprise data of a plurality of data characteristic dimensions;

determining similarity indexes of data distribution of the user data of the reference user group and the user data of the user group to be detected in the data characteristic dimension aiming at each data characteristic dimension in a plurality of data characteristic dimensions;

and determining the occupation degree of invalid users of the user group to be tested according to the similarity index of each data characteristic dimension in the plurality of data characteristic dimensions.

2. The method of claim 1, wherein determining the degree of occupancy of invalid users of the group of users to be tested based on the similarity measure for each of the plurality of data feature dimensions comprises:

determining an occupation index of the occupation degree of invalid users of the user group to be tested under the data characteristic dimension according to the similarity index of each data characteristic dimension in the data characteristic dimensions; the occupation index is used for representing the occupation degree of invalid users of the user group to be detected under the data characteristic dimension;

determining the occupation index of the occupation degree of the invalid users of the user group to be detected under the plurality of data characteristic dimensions according to the occupation index of the occupation degree of the invalid users of the user group to be detected under each data characteristic dimension in the plurality of data characteristic dimensions and the weight value of the data characteristic dimension;

and determining the occupation degree of the invalid users of the user group to be tested according to the occupation indexes of the occupation degrees of the invalid users of the user group to be tested under the multiple data characteristic dimensions.

3. The method of claim 2, wherein the weight value for each of the plurality of data feature dimensions is inversely related to the degree of dispersion of the user data of the reference user group in the data feature dimension.

4. The method of claim 3, wherein the weight value for each of the plurality of data feature dimensions is calculated in particular as follows:

determining, for each of the plurality of data feature dimensions, a standard deviation of the user data of the reference user group in the data feature dimension, thereby characterizing a degree of dispersion of the user data of the reference user group in the data feature dimension;

determining the weight value of the data characteristic dimension according to the reciprocal of the standard deviation of the user data of the reference user group under the plurality of data characteristic dimensions; the weighted value of the data characteristic dimension is positively correlated with the reciprocal of the standard deviation of the user data of the reference user group in the data characteristic dimension.

5. The method of claim 2, wherein the occupancy level is high occupancy or low occupancy; and the occupation index of the occupation degree of the invalid users of the user group to be detected in the data characteristic dimension is the probability that the occupation degree of the invalid users of the user group to be detected in the data characteristic dimension is high or low.

6. The method of claim 1, wherein for each of a plurality of data feature dimensions, determining a similarity indicator of data distribution of user data of the reference user group and user data of the user group under test in the data feature dimension; the method comprises the following steps:

determining, for each of a plurality of data feature dimensions, a subset of a plurality of categories under the data feature dimension;

determining the number of users of each category of the reference user group and the user data of the to-be-detected user group in the plurality of categories of the to-be-detected user group according to the user data of the reference user group and the user data of the to-be-detected user group;

and determining similarity indexes of data distribution of the user data of the reference user group and the user data of the user group to be detected under the data characteristic dimension according to the number of the users of the sub group of each category of the reference user group and the sub group of the user group to be detected in the plurality of categories.

7. The method of any one of claims 1 to 6, wherein the similarity measure of the data distribution in the data feature dimension is: and under the data characteristic dimension, the Pearson correlation coefficient or cosine similarity of the user data of the reference user group and the user data of the user group to be detected.

8. The method of any of claims 1 to 6, wherein the plurality of data characteristic dimensions comprise at least one of: the age bracket of the user; the ID card number of the user is affiliated to the place; the location of the user's mobile phone number home; the user's bank card number home; the user's internet protocol IP address home.

9. An occupancy level determination device for an invalid user, comprising:

the acquisition module is used for acquiring the user data of the reference user group and the user data of the user group to be detected; the reference user group is an invalid user group or an effective user group; the user data of the reference user group and the user data of the user group to be detected both comprise data of a plurality of data characteristic dimensions;

the processing module is used for determining a similarity index of data distribution of the user data of the reference user group and the user data of the user group to be detected in the data characteristic dimension aiming at each data characteristic dimension in a plurality of data characteristic dimensions; and the method is used for determining the occupancy degree of the invalid users of the user group to be tested according to the similarity index of each data characteristic dimension in the plurality of data characteristic dimensions.

10. A computer device comprising a program or instructions that, when executed, perform the method of any of claims 1 to 8.

11. A storage medium comprising a program or instructions which, when executed, perform the method of any one of claims 1 to 8.