CN111506615A - Method and device for determining occupation degree of invalid user - Google Patents

Method and device for determining occupation degree of invalid user Download PDF

Info

Publication number
CN111506615A
CN111506615A CN202010321445.5A CN202010321445A CN111506615A CN 111506615 A CN111506615 A CN 111506615A CN 202010321445 A CN202010321445 A CN 202010321445A CN 111506615 A CN111506615 A CN 111506615A
Authority
CN
China
Prior art keywords
data
user
user group
group
detected
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010321445.5A
Other languages
Chinese (zh)
Inventor
高文辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
WeBank Co Ltd
Original Assignee
WeBank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by WeBank Co Ltd filed Critical WeBank Co Ltd
Priority to CN202010321445.5A priority Critical patent/CN111506615A/en
Publication of CN111506615A publication Critical patent/CN111506615A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0241Advertisements
    • G06Q30/0277Online advertisement

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Finance (AREA)
  • Probability & Statistics with Applications (AREA)
  • Strategic Management (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Development Economics (AREA)
  • Accounting & Taxation (AREA)
  • General Business, Economics & Management (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Marketing (AREA)
  • Economics (AREA)
  • Game Theory and Decision Science (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a method and a device for determining the occupancy degree of an invalid user, wherein the method comprises the following steps: acquiring user data of a reference user group and user data of a user group to be detected; determining similarity indexes of data distribution of the user data of the reference user group and the user data of the user group to be detected in the data characteristic dimension aiming at each data characteristic dimension in a plurality of data characteristic dimensions; and determining the occupation degree of invalid users of the user group to be tested according to the similarity index of each data characteristic dimension in the plurality of data characteristic dimensions. When the method is applied to financial technology (Fintech), even if the user is not in the invalid user list, the occupation degree of the invalid user of the user group to be detected can be determined by comparing the similarity with the user data of the invalid user group or the valid user group, so that the occupation degree of the invalid user can be determined more accurately.

Description

Method and device for determining occupation degree of invalid user
Technical Field
The invention relates to the field of information security in the field of financial technology (Fintech), in particular to a method and a device for determining the occupation degree of an invalid user.
Background
With the development of computer technology, more and more technologies are applied in the financial field, and the traditional financial industry is gradually changing to financial technology (Fintech), but because of the requirements of the financial industry on safety and real-time performance and higher requirements put forward on the technologies, for a financial institution, how to popularize corresponding financial instruments such as financial mobile phone software (Application) is crucial, and the effective number of registered users is an important index for evaluating the popularization degree of the financial instruments.
However, in the current process of popularizing financial instruments, there may be a risk that registered users are flooded with a large number of invalid users, such as a large number of users in wool, which occupy a large amount of popularization resources, and how to discover the risk in time is of great significance to the popularization of financial instruments. In the current method, registered users in the financial instrument are matched with an invalid user list, the occupancy degree of the invalid users is determined, and if the occupancy degree exceeds a preset proportion, a large number of invalid users are determined. Obviously, in the current method, the coverage of the list of the invalid users is limited, and it is difficult to accurately determine the occupancy degree of the invalid users.
Disclosure of Invention
The invention provides a method and a device for determining the occupation degree of an invalid user, which solve the problem that the occupation degree of the invalid user is difficult to accurately determine in the prior art.
In a first aspect, the present invention provides a method for determining occupancy of an invalid user, including: acquiring user data of a reference user group and user data of a user group to be detected; the reference user group is an invalid user group or an effective user group; the user data of the reference user group and the user data of the user group to be detected both comprise data of a plurality of data characteristic dimensions; determining similarity indexes of data distribution of the user data of the reference user group and the user data of the user group to be detected in the data characteristic dimension aiming at each data characteristic dimension in a plurality of data characteristic dimensions; and determining the occupation degree of invalid users of the user group to be tested according to the similarity index of each data characteristic dimension in the plurality of data characteristic dimensions.
According to the method, after the user data of the reference user group and the user data of the user group to be detected are obtained, the similarity of data distribution between the user data of the reference user group and the user data of the user group to be detected under the data characteristic dimension can be determined by comparing the similarity index of each data characteristic dimension in a plurality of data characteristic dimensions, so that even if the user is not in an invalid user list, the occupation degree of the invalid user of the user group to be detected can be determined by comparing the similarity of the user data of the invalid user group or the user data of the valid user group, and the occupation degree of the invalid user can be determined more accurately.
Optionally, the determining, according to the similarity index of each data feature dimension in the plurality of data feature dimensions, the occupancy degree of the invalid user of the user group to be tested includes: determining an occupation index of the occupation degree of invalid users of the user group to be tested under the data characteristic dimension according to the similarity index of each data characteristic dimension in the data characteristic dimensions; the occupation index is used for representing the occupation degree of invalid users of the user group to be detected under the data characteristic dimension; determining the occupation index of the occupation degree of the invalid users of the user group to be detected under the plurality of data characteristic dimensions according to the occupation index of the occupation degree of the invalid users of the user group to be detected under each data characteristic dimension in the plurality of data characteristic dimensions and the weight value of the data characteristic dimension; and determining the occupation degree of the invalid users of the user group to be tested according to the occupation indexes of the occupation degrees of the invalid users of the user group to be tested under the multiple data characteristic dimensions.
According to the method, firstly, according to the similarity index of each data characteristic dimension in the plurality of data characteristic dimensions, the occupation index of the occupation degree of the invalid users of the user group to be detected under the data characteristic dimension is determined, so that the occupation degree of the invalid users of the user group to be detected under the data characteristic dimension is represented firstly, the occupation index of the occupation degree of the invalid users of the user group to be detected under the condition of integrating the plurality of data characteristic dimensions is represented, the occupation degree of the invalid users under the condition of integrating the plurality of data characteristic dimensions is represented, and therefore the occupation degree of the invalid users of the user group to be detected is accurately determined by comprehensively considering the plurality of data characteristic dimensions and the corresponding weight values.
Optionally, the weight value of each of the plurality of data feature dimensions is inversely related to the degree of dispersion of the user data of the reference user group in the data feature dimension.
In the method, for the user data of the reference user group and the user data of the user group to be detected, the smaller the dispersion degree in one data characteristic dimension is, the lower the difference between the data is, so that the similarity index in the data characteristic dimension is more sensitive to the difference of the data, and the occupation degree of invalid users can be more obviously reflected by setting a negative correlation relationship to the weight value.
Optionally, the weight value of each data feature dimension in the plurality of data feature dimensions is specifically calculated according to the following method: determining, for each of the plurality of data feature dimensions, a standard deviation of the user data of the reference user group in the data feature dimension, thereby characterizing a degree of dispersion of the user data of the reference user group in the data feature dimension; determining the weight value of the data characteristic dimension according to the reciprocal of the standard deviation of the user data of the reference user group under the plurality of data characteristic dimensions; the weighted value of the data characteristic dimension is positively correlated with the reciprocal of the standard deviation of the user data of the reference user group in the data characteristic dimension.
In the above method, the dispersion degree of the user data of the reference user group in the data feature dimension is represented by the standard deviation of the user data of the reference user group in the data feature dimension, and the weight value of the data feature dimension is determined according to the inverse of the standard deviation of the user data of the reference user group in the data feature dimensions, so that a method for determining the weight value of each data feature dimension in the data feature dimensions according to the standard deviation and the inverse of the standard deviation is provided.
Optionally, the occupancy level is high occupancy or low occupancy; and the occupation index of the occupation degree of the invalid users of the user group to be detected in the data characteristic dimension is the probability that the occupation degree of the invalid users of the user group to be detected in the data characteristic dimension is high or low.
In the mode, the occupation degree is represented by the probability of high occupation or the probability of low occupation, so that the occupation degree of invalid users of the user group to be tested is more intuitively represented.
Optionally, for each data feature dimension in the multiple data feature dimensions, determining a similarity index of data distribution of the user data of the reference user group and the user data of the user group to be tested in the data feature dimension; the method comprises the following steps: determining, for each of a plurality of data feature dimensions, a subset of a plurality of categories under the data feature dimension; determining the number of users of each category of the reference user group and the user data of the to-be-detected user group in the plurality of categories of the to-be-detected user group according to the user data of the reference user group and the user data of the to-be-detected user group; and determining similarity indexes of data distribution of the user data of the reference user group and the user data of the user group to be detected under the data characteristic dimension according to the number of the users of the sub group of each category of the reference user group and the sub group of the user group to be detected in the plurality of categories.
In the above manner, each data characteristic dimension is subdivided into subgroups of multiple categories, and the number of users in each subgroup of the reference user group and the user group to be tested is determined, so that the similarity index of the data distribution of the user data of the reference user group and the user data of the user group to be tested in the data characteristic dimension is determined by considering the subgroups of the multiple categories.
Optionally, the similarity index of the data distribution under the data feature dimension is: and under the data characteristic dimension, the Pearson correlation coefficient or cosine similarity of the user data of the reference user group and the user data of the user group to be detected.
In the above manner, the pearson correlation coefficient or the cosine similarity can represent the similarity between the user data of the reference user group and the user data of the user group to be detected.
Optionally, the plurality of data feature dimensions include at least one of: the age bracket of the user; the ID card number of the user is affiliated to the place; the location of the user's mobile phone number home; the user's bank card number home; the user's internet protocol IP address home.
Since the frequent expression of the invalid users is in the concentration of age groups or the concentration of various number attributions, the occupation degree of the invalid users can be more accurately determined by focusing on the plurality of data characteristic dimensions.
In a second aspect, the present invention provides an occupancy level determination device for an invalid user, comprising: the acquisition module is used for acquiring the user data of the reference user group and the user data of the user group to be detected; the reference user group is an invalid user group or an effective user group; the user data of the reference user group and the user data of the user group to be detected both comprise data of a plurality of data characteristic dimensions; the processing module is used for determining a similarity index of data distribution of the user data of the reference user group and the user data of the user group to be detected in the data characteristic dimension aiming at each data characteristic dimension in a plurality of data characteristic dimensions; and the method is used for determining the occupancy degree of the invalid users of the user group to be tested according to the similarity index of each data characteristic dimension in the plurality of data characteristic dimensions.
Optionally, the processing module is specifically configured to: determining an occupation index of the occupation degree of invalid users of the user group to be tested under the data characteristic dimension according to the similarity index of each data characteristic dimension in the data characteristic dimensions; the occupation index is used for representing the occupation degree of invalid users of the user group to be detected under the data characteristic dimension; determining the occupation index of the occupation degree of the invalid users of the user group to be detected under the plurality of data characteristic dimensions according to the occupation index of the occupation degree of the invalid users of the user group to be detected under each data characteristic dimension in the plurality of data characteristic dimensions and the weight value of the data characteristic dimension; and determining the occupation degree of the invalid users of the user group to be tested according to the occupation indexes of the occupation degrees of the invalid users of the user group to be tested under the multiple data characteristic dimensions.
Optionally, the weight value of each of the plurality of data feature dimensions is inversely related to the degree of dispersion of the user data of the reference user group in the data feature dimension.
Optionally, the weight value of each data feature dimension in the plurality of data feature dimensions is specifically calculated according to the following method: determining, for each of the plurality of data feature dimensions, a standard deviation of the user data of the reference user group in the data feature dimension, thereby characterizing a degree of dispersion of the user data of the reference user group in the data feature dimension; determining the weight value of the data characteristic dimension according to the reciprocal of the standard deviation of the user data of the reference user group under the plurality of data characteristic dimensions; the weighted value of the data characteristic dimension is positively correlated with the reciprocal of the standard deviation of the user data of the reference user group in the data characteristic dimension.
Optionally, the occupancy level is high occupancy or low occupancy; and the occupation index of the occupation degree of the invalid users of the user group to be detected in the data characteristic dimension is the probability that the occupation degree of the invalid users of the user group to be detected in the data characteristic dimension is high or low.
Optionally, the processing module is specifically configured to: determining, for each of a plurality of data feature dimensions, a subset of a plurality of categories under the data feature dimension; determining the number of users of each category of the reference user group and the user data of the to-be-detected user group in the plurality of categories of the to-be-detected user group according to the user data of the reference user group and the user data of the to-be-detected user group; and determining similarity indexes of data distribution of the user data of the reference user group and the user data of the user group to be detected under the data characteristic dimension according to the number of the users of the sub group of each category of the reference user group and the sub group of the user group to be detected in the plurality of categories.
Optionally, the similarity index of the data distribution under the data feature dimension is: and under the data characteristic dimension, the Pearson correlation coefficient or cosine similarity of the user data of the reference user group and the user data of the user group to be detected.
Optionally, the plurality of data feature dimensions include at least one of: the age bracket of the user; the ID card number of the user is affiliated to the place; the location of the user's mobile phone number home; the user's bank card number home; the user's internet protocol IP address home.
The advantageous effects of the second aspect and the various optional apparatuses of the second aspect may refer to the advantageous effects of the first aspect and the various optional methods of the first aspect, and are not described herein again.
In a third aspect, the present invention provides a computer device comprising a program or instructions for performing the method of the first aspect and the alternatives of the first aspect when the program or instructions are executed.
In a fourth aspect, the present invention provides a storage medium comprising a program or instructions which, when executed, is adapted to perform the method of the first aspect and the alternatives of the first aspect.
Drawings
Fig. 1 is a schematic flowchart illustrating steps of a method for determining occupancy of an invalid user according to an embodiment of the present disclosure;
fig. 2 is a schematic structural diagram of an occupancy level determination device for an invalid user according to an embodiment of the present application.
Detailed Description
In order to better understand the technical solutions, the technical solutions will be described in detail below with reference to the drawings and the specific embodiments of the specification, and it should be understood that the specific features in the embodiments and examples of the present application are detailed descriptions of the technical solutions of the present application, but not limitations of the technical solutions of the present application, and the technical features in the embodiments and examples of the present application may be combined with each other without conflict.
The following first lists the nouns appearing in the examples of the present application.
The effective user: the user of the App is registered for normal use purposes.
And (4) invalid users: users who register App for purposes of abnormal use, such as users playing wool, register App primarily for the purpose of obtaining a certain promotional activity reward, and then rarely log in to App.
And (3) real name registration: refers to user information submitted when registering an App.
Pearson correlation test: whether two groups of data are obviously correlated or not is checked by calculating a correlation coefficient of the two groups of data X, Y with the sample size of n, the value range of the correlation coefficient is [ -1,1], the closer the value is to 1, the stronger the positive linear correlation of the two groups of data is, the closer the value is to-1, the stronger the negative linear correlation of the two groups of data is, the closer the value is to 0, the more no linear correlation of the two groups of data is, and the specific definition of the correlation coefficient is as follows
Figure BDA0002461579530000071
In the operation process of a financial institution (a banking institution, an insurance institution or a security institution) in business (such as loan business, deposit business and the like of a bank), how to popularize a corresponding financial tool App to an effective user is of great significance to the financial institution. However, a large number of invalid users may be registered in the current financial instrument promotion process, and the invalid users waste more promotion resources, so how to determine the occupation degree of the invalid users is a problem of great concern. In the conventional method, the occupancy degree of the invalid user is determined by matching the registered user in the financial instrument with the invalid user list, but the coverage of the invalid user list is limited, and it is difficult to accurately determine the occupancy degree of the invalid user. This situation does not meet the requirements of financial institutions such as banks, and the efficient operation of various services of the financial institutions cannot be ensured.
For this reason, as shown in fig. 1, the embodiment of the present application provides a method for determining the occupancy level of an invalid user.
Step 101: and acquiring the user data of the reference user group and the user data of the user group to be tested.
Step 102: and determining similarity indexes of data distribution of the user data of the reference user group and the user data of the user group to be detected in the data characteristic dimension aiming at each data characteristic dimension in a plurality of data characteristic dimensions.
Step 103: and determining the occupation degree of invalid users of the user group to be tested according to the similarity index of each data characteristic dimension in the plurality of data characteristic dimensions.
In steps 101 to 103, the meaning of the user may be a registered user, and the reference user group is an invalid user group or a valid user group. The invalid user group may be a user group in which users include more (e.g., exceed an invalid user ratio threshold) invalid users, and the valid user group may be a user group in which users include more (e.g., exceed a valid user ratio threshold) valid users.
Specifically, the invalid user group or the valid user group may be determined according to the occurrence of the invalid user in the user group within the preset historical time range. The reference user group (e.g. labeled as user _ group0) may be an invalid user group or a valid user group, and whether a suspected large number of invalid users exist is detected as the user group to be detected (e.g. labeled as user _ group1), for example, all users registered on the nth day are used as user _ group1, and all users registered on the nth-7 th day are used as user _ group 0.
The user data of the reference user group and the user data of the user group to be tested both comprise data of a plurality of data characteristic dimensions. In particular, a data feature dimension may refer to some aspect of a property used to describe a set of user data. For example, the data feature dimension may be an age of the user, may be an online duration of the user, and the like. Under each data characteristic dimension in a plurality of data characteristic dimensions, the similarity index of the user data of the reference user group and the user data of the user group to be detected represents the data similarity between the user data of the reference user group and the user data of the user group to be detected, and the similarity index represents that the more similar the user data of the reference user group and the user data of the user group to be detected (such as an invalid user group), the higher the occupation degree of invalid users in the user group to be detected is, or the lower the occupation degree of valid users in the user group to be detected is. And the same is true for the condition that the user group to be tested is an effective user group. In addition, the occupancy degree can define specific values, for example, the occupancy degree is high or low, further, more values (for example, multiple occupancy levels) can be defined to represent the occupancy degree of an effective user or an invalid user, and a mapping relationship between the value of the similarity index and the occupancy degree value can be established.
It should be noted that, in general, an invalid user may exhibit an aggregation characteristic in some data feature dimensions during registration, for example, the age of the registered user is relatively large, the identification number used for registration is concentrated in some places, and the IP address reported during registration is concentrated in some places.
Thus, a particular data feature dimension may be selected as the plurality of data feature dimensions, which in an alternative embodiment may include at least one of: the age bracket of the user; the ID card number of the user is affiliated to the place; the location of the user's mobile phone number home; the user's bank card number home; the user's internet protocol IP address home.
An alternative implementation of step 102 is as follows:
determining, for each of a plurality of data feature dimensions, a subset of a plurality of categories under the data feature dimension; determining the number of users of each category of the reference user group and the user data of the to-be-detected user group in the plurality of categories of the to-be-detected user group according to the user data of the reference user group and the user data of the to-be-detected user group; and determining similarity indexes of data distribution of the user data of the reference user group and the user data of the user group to be detected under the data characteristic dimension according to the number of the users of the sub group of each category of the reference user group and the sub group of the user group to be detected in the plurality of categories.
Specifically, as an example of the above optional implementation mode of selecting a plurality of data feature dimensions, the plurality of data feature dimensions may be obtained by processing in the following manner:
a) statistics of the number of users of a subgroup of a plurality of categories of the age group of users: the ages of the users at the time of real-name registration are grouped according to age groups, such as 20 years old and below, 21-30 years old, 31-40 years old, 41-50 years old, 51-60 years old, 61-70 years old, 71-80 years old, 81-90 years old, 91 years old and above, and the number of users in each subgroup of each category of each age group is counted and can be recorded as age _ cnt.
b) Counting the number of users of the subgroup of the multiple categories of the attribution of the identification number of the user: and intercepting the first 6 relevant digits of the identification number to obtain the city to which the identification number belongs, counting the number of users in each category of subgroup of the attributive place of the identification number, and marking as idro _ attr _ cnt.
c) Counting the number of users of subgroups of multiple categories of the mobile phone number attribution places of the users: intercepting the first k (such as 7) digits of the mobile phone number to obtain the mobile phone number home city, if the mobile phone number is the virtual operator mobile phone number, such as beginning of 147, 170, 171, dividing the home location into a special virtual operator class subgroup (also regarded as a city class subgroup), counting the number of users in each class subgroup of the mobile phone number home location, and recording as phone _ attr _ cnt.
d) Counting the number of users of a subgroup of a plurality of categories of the place to which the bank card number of the user belongs: and associating the bank card number to obtain a bank card number attribution, and counting the number of users in each sub-group of the bank card number attribution, and recording the number as card _ attr _ cnt.
e) Counting a number of users of a subset of a plurality of categories to which a user's Internet Protocol (IP) address belongs: and (4) associating the IP addresses to obtain IP address attribution, and counting the number of users of each category of subgroup of the IP addresses, and recording as IP _ attr _ cnt.
Simultaneously aiming at the user data of the user group to be tested and referring to the same data characteristic of the user data of the user group, if one group of users has a certain subgroup, but the other group of users does not have a corresponding subgroup, the corresponding number of the group of users on the subgroup is filled with 0, such as two 0 in table 1:
Figure BDA0002461579530000101
TABLE 1
It should be noted that, in an optional implementation manner of steps 101 to 103, the similarity index of the data distribution in the data feature dimension may be: and under the data characteristic dimension, the Pearson correlation coefficient or cosine similarity of the user data of the reference user group and the user data of the user group to be detected. The similarity index ρ of the user data of the reference user group and the user data of the user group to be tested in the data characteristic dimension of the age group of the user in step 102 is described below by taking the pearson correlation coefficient and taking the reference user group as an effective user group as an exampleage
Figure BDA0002461579530000102
Wherein age _ cnt0iTo refer to the number of users in the subgroup of the i-th age group of the user group,
Figure BDA0002461579530000103
age _ cnt1 as an average number of users in a subgroup of n age groups of a reference user groupiThe number of users in the subgroup of the ith age group of the user group to be tested,
Figure BDA0002461579530000111
the average number of users in the subgroup of n age groups of the user group to be tested.
An alternative implementation of step 103 is as follows:
step (103-1): and determining an occupation index of the occupation degree of invalid users of the user group to be tested under the data characteristic dimension according to the similarity index of each data characteristic dimension in the data characteristic dimensions.
Step (103-2): and determining the occupation index of the occupation degree of the invalid users of the user group to be detected under the plurality of data characteristic dimensions according to the occupation index of the occupation degree of the invalid users of the user group to be detected under each data characteristic dimension in the plurality of data characteristic dimensions and the weight value of the data characteristic dimension.
Step (103-3): and determining the occupation degree of the invalid users of the user group to be tested according to the occupation indexes of the occupation degrees of the invalid users of the user group to be tested under the multiple data characteristic dimensions.
In the step (103-1), the occupancy index is used for representing the occupancy degree of invalid users of the user group to be tested under the data characteristic dimension. In an alternative embodiment, the degree of occupancy is high occupancy or low occupancy; and the occupation index of the occupation degree of the invalid users of the user group to be detected in the data characteristic dimension is the probability that the occupation degree of the invalid users of the user group to be detected in the data characteristic dimension is high or low.
In an alternative embodiment where the occupancy level is a probability, ρ is the similarity indexageIn the example of the optional implementation manner in (3), the step (103-1) obtains the probability P that the occupation degree of the invalid users of the user group to be tested is high in the data characteristic dimension of the age group of the usersageAn example of (A) can be as follows, PageThe value range is [0, 1]]Specifically:
Figure BDA0002461579530000112
according to the definition of the Pearson correlation coefficient, under a certain data characteristic dimension (such as the age of a user), the stronger the forward linear correlation between the user data of the reference user group and the user data of the user group to be tested is, the more similar the distribution of the user data of the reference user group (effective user group) and the user data of the user group to be tested is, the more P isageThe smaller the data characteristic dimension of the age group of the user, the lower the possibility that the occupation degree of the invalid user of the user group to be tested is high, and the corresponding probabilities of other data characteristic dimensions are similar.
Further, based on the above example of step (103-1), the user's information can be obtained in orderAge group; the ID card number of the user is affiliated to the place; the location of the user's mobile phone number home; the user's bank card number home; the occupation degree of the invalid users of the user group to be tested belonging to the underground of the Internet protocol IP address of the user is high occupation probability: page、Pidno、Pphone、Pcard、PipAnd according to the weight value of each data characteristic dimension (corresponding to w respectively)age、widno、wphone、wcard、wip) Step (103-2) is executed specifically, S ═ wage*Page+widno*Pidno+wphone*Pphone+wcard*Pcard+wip*Pip. And S is an occupation index of the occupation degree of the invalid users of the user group to be detected under the multiple data characteristic dimensions.
In an alternative embodiment, the weight value of each data feature dimension may be set as follows: the weight value of each data characteristic dimension in the plurality of data characteristic dimensions is in negative correlation with the discrete degree of the user data of the reference user group under the data characteristic dimension.
As the smaller the dispersion degree of the user data of the reference user group and the user data of the user group to be detected is, the lower the difference between the data is, the more sensitive the similarity index of the data characteristic dimension is to the difference of the data, so that the occupation degree of invalid users can be reflected more obviously by setting a negative correlation relationship to the weight value.
It should be noted that, the weight value of each data feature dimension in the multiple data feature dimensions is specifically calculated as follows:
determining, for each of the plurality of data feature dimensions, a standard deviation of the user data of the reference user group in the data feature dimension, thereby characterizing a degree of dispersion of the user data of the reference user group in the data feature dimension; determining the weight value of the data characteristic dimension according to the reciprocal of the standard deviation of the user data of the reference user group under the plurality of data characteristic dimensions; the weighted value of the data characteristic dimension is positively correlated with the reciprocal of the standard deviation of the user data of the reference user group in the data characteristic dimension.
Taking the data characteristic dimension as the age group of the user as an example, the weight value w corresponding to the age group of the userageThis may be determined as follows:
Figure BDA0002461579530000131
wherein sigmaage、σidno、σphone、σcard、σipThe data of the reference user group are sequentially in the age group of the user; the ID card number of the user is affiliated to the place; the location of the user's mobile phone number home; the user's bank card number home; standard deviation under the characteristic dimension of the internet protocol IP address attribution data of the user.
Thus wage+widno+wphone+wcard+wip1, S has a value in the range of [0, 1]]. And a preset alarm threshold corresponding to the S can be set, when the S is greater than the preset alarm threshold, an alarm is given, an invalid user with a high occupation ratio exists in the user group to be detected, and meanwhile, the more the S value approaches to 1, the more serious the phenomenon of invalid user registration is.
As shown in fig. 2, the present invention provides an occupancy level determination device for an invalid user, comprising: an obtaining module 201, configured to obtain user data of a reference user group and user data of a user group to be detected; the reference user group is an invalid user group or an effective user group; the user data of the reference user group and the user data of the user group to be detected both comprise data of a plurality of data characteristic dimensions; a processing module 202, configured to determine, for each data feature dimension of multiple data feature dimensions, a similarity index of data distribution of the user data of the reference user group and the user data of the user group to be tested in the data feature dimension; and the method is used for determining the occupancy degree of the invalid users of the user group to be tested according to the similarity index of each data characteristic dimension in the plurality of data characteristic dimensions.
Optionally, the processing module 202 is specifically configured to: determining an occupation index of the occupation degree of invalid users of the user group to be tested under the data characteristic dimension according to the similarity index of each data characteristic dimension in the data characteristic dimensions; the occupation index is used for representing the occupation degree of invalid users of the user group to be detected under the data characteristic dimension; determining the occupation index of the occupation degree of the invalid users of the user group to be detected under the plurality of data characteristic dimensions according to the occupation index of the occupation degree of the invalid users of the user group to be detected under each data characteristic dimension in the plurality of data characteristic dimensions and the weight value of the data characteristic dimension; and determining the occupation degree of the invalid users of the user group to be tested according to the occupation indexes of the occupation degrees of the invalid users of the user group to be tested under the multiple data characteristic dimensions.
Optionally, the weight value of each of the plurality of data feature dimensions is inversely related to the degree of dispersion of the user data of the reference user group in the data feature dimension.
Optionally, the weight value of each data feature dimension in the plurality of data feature dimensions is specifically calculated according to the following method: determining, for each of the plurality of data feature dimensions, a standard deviation of the user data of the reference user group in the data feature dimension, thereby characterizing a degree of dispersion of the user data of the reference user group in the data feature dimension; determining the weight value of the data characteristic dimension according to the reciprocal of the standard deviation of the user data of the reference user group under the plurality of data characteristic dimensions; the weighted value of the data characteristic dimension is positively correlated with the reciprocal of the standard deviation of the user data of the reference user group in the data characteristic dimension.
Optionally, the occupancy level is high occupancy or low occupancy; and the occupation index of the occupation degree of the invalid users of the user group to be detected in the data characteristic dimension is the probability that the occupation degree of the invalid users of the user group to be detected in the data characteristic dimension is high or low.
Optionally, the processing module 202 is specifically configured to: determining, for each of a plurality of data feature dimensions, a subset of a plurality of categories under the data feature dimension; determining the number of users of each category of the reference user group and the user data of the to-be-detected user group in the plurality of categories of the to-be-detected user group according to the user data of the reference user group and the user data of the to-be-detected user group; and determining similarity indexes of data distribution of the user data of the reference user group and the user data of the user group to be detected under the data characteristic dimension according to the number of the users of the sub group of each category of the reference user group and the sub group of the user group to be detected in the plurality of categories.
Optionally, the similarity index of the data distribution under the data feature dimension is: and under the data characteristic dimension, the Pearson correlation coefficient or cosine similarity of the user data of the reference user group and the user data of the user group to be detected.
Optionally, the plurality of data feature dimensions include at least one of: the age bracket of the user; the ID card number of the user is affiliated to the place; the location of the user's mobile phone number home; the user's bank card number home; the user's internet protocol IP address home.
Embodiments of the present application provide a computer device, which includes a program or instructions, and when the program or instructions are executed, the program or instructions are used to execute an occupancy level determination method for invalid users and any optional method provided by embodiments of the present application.
Embodiments of the present application provide a storage medium, which includes a program or an instruction, and when the program or the instruction is executed, the program or the instruction is used to execute a method for determining occupancy of an invalid user and any optional method provided by embodiments of the present application.
Finally, it should be noted that: as will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims (11)

1. A method for determining occupancy of an invalid user, comprising:
acquiring user data of a reference user group and user data of a user group to be detected; the reference user group is an invalid user group or an effective user group; the user data of the reference user group and the user data of the user group to be detected both comprise data of a plurality of data characteristic dimensions;
determining similarity indexes of data distribution of the user data of the reference user group and the user data of the user group to be detected in the data characteristic dimension aiming at each data characteristic dimension in a plurality of data characteristic dimensions;
and determining the occupation degree of invalid users of the user group to be tested according to the similarity index of each data characteristic dimension in the plurality of data characteristic dimensions.
2. The method of claim 1, wherein determining the degree of occupancy of invalid users of the group of users to be tested based on the similarity measure for each of the plurality of data feature dimensions comprises:
determining an occupation index of the occupation degree of invalid users of the user group to be tested under the data characteristic dimension according to the similarity index of each data characteristic dimension in the data characteristic dimensions; the occupation index is used for representing the occupation degree of invalid users of the user group to be detected under the data characteristic dimension;
determining the occupation index of the occupation degree of the invalid users of the user group to be detected under the plurality of data characteristic dimensions according to the occupation index of the occupation degree of the invalid users of the user group to be detected under each data characteristic dimension in the plurality of data characteristic dimensions and the weight value of the data characteristic dimension;
and determining the occupation degree of the invalid users of the user group to be tested according to the occupation indexes of the occupation degrees of the invalid users of the user group to be tested under the multiple data characteristic dimensions.
3. The method of claim 2, wherein the weight value for each of the plurality of data feature dimensions is inversely related to the degree of dispersion of the user data of the reference user group in the data feature dimension.
4. The method of claim 3, wherein the weight value for each of the plurality of data feature dimensions is calculated in particular as follows:
determining, for each of the plurality of data feature dimensions, a standard deviation of the user data of the reference user group in the data feature dimension, thereby characterizing a degree of dispersion of the user data of the reference user group in the data feature dimension;
determining the weight value of the data characteristic dimension according to the reciprocal of the standard deviation of the user data of the reference user group under the plurality of data characteristic dimensions; the weighted value of the data characteristic dimension is positively correlated with the reciprocal of the standard deviation of the user data of the reference user group in the data characteristic dimension.
5. The method of claim 2, wherein the occupancy level is high occupancy or low occupancy; and the occupation index of the occupation degree of the invalid users of the user group to be detected in the data characteristic dimension is the probability that the occupation degree of the invalid users of the user group to be detected in the data characteristic dimension is high or low.
6. The method of claim 1, wherein for each of a plurality of data feature dimensions, determining a similarity indicator of data distribution of user data of the reference user group and user data of the user group under test in the data feature dimension; the method comprises the following steps:
determining, for each of a plurality of data feature dimensions, a subset of a plurality of categories under the data feature dimension;
determining the number of users of each category of the reference user group and the user data of the to-be-detected user group in the plurality of categories of the to-be-detected user group according to the user data of the reference user group and the user data of the to-be-detected user group;
and determining similarity indexes of data distribution of the user data of the reference user group and the user data of the user group to be detected under the data characteristic dimension according to the number of the users of the sub group of each category of the reference user group and the sub group of the user group to be detected in the plurality of categories.
7. The method of any one of claims 1 to 6, wherein the similarity measure of the data distribution in the data feature dimension is: and under the data characteristic dimension, the Pearson correlation coefficient or cosine similarity of the user data of the reference user group and the user data of the user group to be detected.
8. The method of any of claims 1 to 6, wherein the plurality of data characteristic dimensions comprise at least one of: the age bracket of the user; the ID card number of the user is affiliated to the place; the location of the user's mobile phone number home; the user's bank card number home; the user's internet protocol IP address home.
9. An occupancy level determination device for an invalid user, comprising:
the acquisition module is used for acquiring the user data of the reference user group and the user data of the user group to be detected; the reference user group is an invalid user group or an effective user group; the user data of the reference user group and the user data of the user group to be detected both comprise data of a plurality of data characteristic dimensions;
the processing module is used for determining a similarity index of data distribution of the user data of the reference user group and the user data of the user group to be detected in the data characteristic dimension aiming at each data characteristic dimension in a plurality of data characteristic dimensions; and the method is used for determining the occupancy degree of the invalid users of the user group to be tested according to the similarity index of each data characteristic dimension in the plurality of data characteristic dimensions.
10. A computer device comprising a program or instructions that, when executed, perform the method of any of claims 1 to 8.
11. A storage medium comprising a program or instructions which, when executed, perform the method of any one of claims 1 to 8.
CN202010321445.5A 2020-04-22 2020-04-22 Method and device for determining occupation degree of invalid user Pending CN111506615A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010321445.5A CN111506615A (en) 2020-04-22 2020-04-22 Method and device for determining occupation degree of invalid user

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010321445.5A CN111506615A (en) 2020-04-22 2020-04-22 Method and device for determining occupation degree of invalid user

Publications (1)

Publication Number Publication Date
CN111506615A true CN111506615A (en) 2020-08-07

Family

ID=71877834

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010321445.5A Pending CN111506615A (en) 2020-04-22 2020-04-22 Method and device for determining occupation degree of invalid user

Country Status (1)

Country Link
CN (1) CN111506615A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112905476A (en) * 2021-03-12 2021-06-04 网易(杭州)网络有限公司 Test execution method and device, electronic equipment and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108764332A (en) * 2018-05-25 2018-11-06 北京证大向上金融信息服务有限公司 A kind of Channel Quality analysis method, computing device and storage medium
CN109284380A (en) * 2018-09-25 2019-01-29 平安科技(深圳)有限公司 Illegal user's recognition methods and device, electronic equipment based on big data analysis
CN109873812A (en) * 2019-01-28 2019-06-11 腾讯科技(深圳)有限公司 Method for detecting abnormality, device and computer equipment
CN109949069A (en) * 2019-01-28 2019-06-28 平安科技(深圳)有限公司 Suspicious user screening technique, device, computer equipment and storage medium
CN110189165A (en) * 2019-05-14 2019-08-30 微梦创科网络科技(中国)有限公司 Channel abnormal user and abnormal channel recognition methods and device
CN110351307A (en) * 2019-08-14 2019-10-18 杭州安恒信息技术股份有限公司 Abnormal user detection method and system based on integrated study
CN110517097A (en) * 2019-09-09 2019-11-29 平安普惠企业管理有限公司 Identify method, apparatus, equipment and the storage medium of abnormal user
CN110750238A (en) * 2019-09-20 2020-02-04 阿里巴巴集团控股有限公司 Method and device for determining product requirements and electronic equipment

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108764332A (en) * 2018-05-25 2018-11-06 北京证大向上金融信息服务有限公司 A kind of Channel Quality analysis method, computing device and storage medium
CN109284380A (en) * 2018-09-25 2019-01-29 平安科技(深圳)有限公司 Illegal user's recognition methods and device, electronic equipment based on big data analysis
CN109873812A (en) * 2019-01-28 2019-06-11 腾讯科技(深圳)有限公司 Method for detecting abnormality, device and computer equipment
CN109949069A (en) * 2019-01-28 2019-06-28 平安科技(深圳)有限公司 Suspicious user screening technique, device, computer equipment and storage medium
CN110189165A (en) * 2019-05-14 2019-08-30 微梦创科网络科技(中国)有限公司 Channel abnormal user and abnormal channel recognition methods and device
CN110351307A (en) * 2019-08-14 2019-10-18 杭州安恒信息技术股份有限公司 Abnormal user detection method and system based on integrated study
CN110517097A (en) * 2019-09-09 2019-11-29 平安普惠企业管理有限公司 Identify method, apparatus, equipment and the storage medium of abnormal user
CN110750238A (en) * 2019-09-20 2020-02-04 阿里巴巴集团控股有限公司 Method and device for determining product requirements and electronic equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112905476A (en) * 2021-03-12 2021-06-04 网易(杭州)网络有限公司 Test execution method and device, electronic equipment and storage medium
CN112905476B (en) * 2021-03-12 2023-08-11 网易(杭州)网络有限公司 Test execution method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
US8311907B2 (en) System and method for detecting fraudulent transactions
CN103748579B (en) Data are handled in MapReduce frame
US7693767B2 (en) Method for generating predictive models for a business problem via supervised learning
CN110046929B (en) Fraudulent party identification method and device, readable storage medium and terminal equipment
CN110060053B (en) Identification method, equipment and computer readable medium
WO2021254027A1 (en) Method and apparatus for identifying suspicious community, and storage medium and computer device
CN107633257B (en) Data quality evaluation method and device, computer readable storage medium and terminal
CN112990294B (en) Training method and device of behavior discrimination model, electronic equipment and storage medium
CN110866698A (en) Device for assessing service score of service provider
CN111062770A (en) Merchant identification method, equipment and computer readable medium
CN114595765A (en) Data processing method and device, electronic equipment and storage medium
CN111506615A (en) Method and device for determining occupation degree of invalid user
CN111861733B (en) Fraud prevention and control system and method based on address fuzzy matching
CN110348983B (en) Transaction information management method and device, electronic equipment and non-transitory storage medium
CN113225325B (en) IP (Internet protocol) blacklist determining method, device, equipment and storage medium
CN108629506A (en) Modeling method, device, computer equipment and the storage medium of air control model
CN112910879B (en) Malicious domain name analysis method and system
CN111800409A (en) Interface attack detection method and device
CN109919811B (en) Insurance agent culture scheme generation method based on big data and related equipment
CN109063097B (en) Data comparison and consensus method based on block chain
CN112261484B (en) Target user identification method and device, electronic equipment and storage medium
CN112700322B (en) Order sampling detection method, order sampling detection device, electronic equipment and storage medium
CN109191334A (en) Five heavy duplicate removal multiple level marketing data analysing methods
CN113283908B (en) Target group identification method and device
CN111428050B (en) Method and device for evaluating knowledge graph, computer storage medium and terminal

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination