CN114841526A - Detection method of high-risk user, computing device and readable storage medium - Google Patents

Detection method of high-risk user, computing device and readable storage medium Download PDF

Info

Publication number
CN114841526A
CN114841526A CN202210383838.8A CN202210383838A CN114841526A CN 114841526 A CN114841526 A CN 114841526A CN 202210383838 A CN202210383838 A CN 202210383838A CN 114841526 A CN114841526 A CN 114841526A
Authority
CN
China
Prior art keywords
user
data sample
risk
risk score
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210383838.8A
Other languages
Chinese (zh)
Inventor
邓永国
范光亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Cheerbright Technologies Co Ltd
Original Assignee
Beijing Cheerbright Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Cheerbright Technologies Co Ltd filed Critical Beijing Cheerbright Technologies Co Ltd
Priority to CN202210383838.8A priority Critical patent/CN114841526A/en
Publication of CN114841526A publication Critical patent/CN114841526A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0635Risk analysis of enterprise or organisation activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • G06Q10/06393Score-carding, benchmarking or key performance indicator [KPI] analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Theoretical Computer Science (AREA)
  • Strategic Management (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Economics (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Educational Administration (AREA)
  • Development Economics (AREA)
  • Marketing (AREA)
  • Quality & Reliability (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • General Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Artificial Intelligence (AREA)
  • Tourism & Hospitality (AREA)
  • Operations Research (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a detection method of a high-risk user, a computing device and a readable storage medium, wherein the method comprises the following steps: acquiring a first data sample of a first user, wherein the first data sample comprises registration behavior characteristics and login behavior characteristics of the first user; inputting the first data sample into a trained detection model, and outputting a probability value for predicting that the first user belongs to a high-risk user as a first risk score; acquiring expert rules of the target marketing campaign and behavior characteristics of the first user in the target marketing campaign; evaluating the behavior characteristics of the first user in the target marketing activity based on the acquired expert rules to obtain a second risk score; based on the first risk score and the second risk score, it is determined whether the first user is a high risk user. The technical scheme of the invention combines the advantages of the detection model and the expert rule, and provides a stable and reliable detection method for the users with high user risk.

Description

Detection method of high-risk user, computing device and readable storage medium
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a method for detecting a high-risk user, a computing device, and a readable storage medium.
Background
In internet marketing planning, we often refer to "just with right, odd with win", just with strategy, odd with strategy. According to different marketing strategies, merchants often hold some wonderful activities in various quantities and forms, such as profession of friends, forwarding of microblogging lottery, pulling new users to give benefits, taking part in activities to get coupons and the like. Most of these events have some specific goals, such as promotion, renewal, promotion, etc. However, in the specific implementation of the event, some malicious users may "cheat" by various means, some of which are self-called "woolen customers" aiming at pulling wool, i.e. attempting to realize a profit without complying with the rules of the event. The cheating behavior can cause great damage to the event host, not only cause economic loss, but also damage the benefits of other users who normally participate in the event, and possibly cause damage to the credit of the host, finally cause great discount on the event effect, and the event goal can not be realized.
Therefore, identifying users of black grey products of wool in tear is an indispensable topic for wind-controlled personnel, and existing risk identification techniques/risk scores can be divided into two categories: the risk scoring method is based on expert rules and artificial intelligence. The risk scoring method based on the expert rules can be based on a single rule or can be combined with a plurality of rules, and the defects are obvious, and include: the generalization capability is weak, and the black product characteristic behavior is probably not recognized after slight change. And performing feature processing and model training on the marketing scene data by adopting artificial intelligence technologies such as machine learning, deep learning and the like based on an artificial intelligence risk scoring method, and performing risk assessment according to a model result. The disadvantage is that the accuracy and the on-line period are lower and longer than those of the expert rule scheme.
Therefore, a method for detecting high-risk users is needed to improve the accuracy and generalization capability of detection.
Disclosure of Invention
To this end, the present invention provides a method of detecting a high risk user, a computing device and a readable storage medium, in an attempt to solve or at least alleviate at least one of the problems presented above.
According to an aspect of the present invention, there is provided a method for detecting a high-risk user, executed in a computing device, the method comprising the steps of: acquiring a first data sample of a first user, wherein the first data sample comprises registration behavior characteristics and login behavior characteristics of the first user; inputting the first data sample into a trained detection model, and outputting a probability value for predicting that the first user belongs to a high-risk user as a first risk score; acquiring expert rules of the target marketing campaign and behavior characteristics of the first user in the target marketing campaign; evaluating the behavior characteristics of the first user in the target marketing activity based on the acquired expert rules to obtain a second risk score; based on the first risk score and the second risk score, it is determined whether the first user is a high risk user.
Optionally, in the method for detecting a high-risk user according to the present invention, the step of determining whether the first user is a high-risk user based on the first risk score and the second risk score includes: performing fusion processing on the first risk score and the second risk score to obtain a third risk score; determining that the first user belongs to a high-risk user if the third risk score is higher than a first predetermined value; otherwise, determining that the first user does not belong to the high-risk user.
Optionally, in the method for detecting a high-risk user according to the present invention, the second risk score is calculated by the following formula:
Figure BDA0003592894270000021
wherein, rules _ score (x) i ) For user x i A second risk score of (a); r is j (x i ) For user x i J is the total number of rules, and m is the risk score of the jth expert rule of (1).
Optionally, in the detection method for the high-risk user according to the present invention, the trained detection model is generated by: acquiring a second data sample set, wherein the second data sample set comprises second data samples of a plurality of users, and the second data samples comprise registration behavior characteristics and login behavior characteristics of the plurality of users and whether the users belong to high-risk users or not; and training the detection model through the second data sample set until a preset condition is reached to obtain the trained detection model.
Optionally, in the detection method of the high-risk user according to the present invention, the predetermined condition is that a value of a loss function between a predicted value and a true value of the detection model is minimum or an accuracy of the detection model is not increased any more.
Optionally, in the detection method of the high-risk user according to the present invention, before the step of training the detection model by the second data sample set, the method further includes: the second set of data samples is preprocessed.
Optionally, in the method for detecting a high-risk user according to the present invention, the step of preprocessing the second data sample set includes: discarding the second data sample of a user in the second data sample set if the second data sample of the user has a missing item and the number of the missing item exceeds a fourth predetermined value of the total number of data included in the second data sample; if the second data sample of one user in the second data sample set has a missing entry, but the number of the missing entries does not exceed a fourth predetermined value of the total number of data included in the second data sample, the missing entry of the second data sample of the user is filled.
Optionally, in the method for detecting a high-risk user according to the present invention, the step of filling missing items of the second data sample of the user includes: if the missing item in the second data sample of the user belongs to the continuous variable in the second data sample set, acquiring all values corresponding to the missing item from the second data sample set, performing mean calculation on all the values, and filling the missing item in the second data sample of the user by adopting a mean calculation result; and if the missing item in the second data sample of the user belongs to the discrete variable in the second data sample set, acquiring all values corresponding to the missing item from the second data sample set, and filling the missing item of the second data sample of the user by adopting a mode in all the values.
Optionally, in the method for detecting a high-risk user according to the present invention, the registration behavior feature and the login behavior feature include at least one of a registration IP home, a registration time, a registration nickname, a login IP home, a login time, and a login device.
According to another aspect of the invention, there is provided a computing device comprising: one or more processors; and a memory; one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for performing the above-described high-risk user detection method.
According to yet another aspect of the present invention, there is also provided a computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing device, cause the computing device to perform the above-described method of detecting a high-risk user.
According to the technical scheme of the invention, the trained detection model has the characteristic of strong generalization capability, and the method for determining the risk score through the expert rules has the characteristics of high accuracy and strong pertinence. The invention determines whether the first user is a high-risk user or not based on the first risk score and the second risk score, combines the advantages of a detection model and expert rules, and provides a stable and reliable detection method for the high-risk user.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
Drawings
To the accomplishment of the foregoing and related ends, certain illustrative aspects are described herein in connection with the following description and the annexed drawings, which are indicative of various ways in which the principles disclosed herein may be practiced, and all aspects and equivalents thereof are intended to be within the scope of the claimed subject matter. The above and other objects, features and advantages of the present disclosure will become more apparent from the following detailed description read in conjunction with the accompanying drawings. Throughout this disclosure, like reference numerals generally refer to like parts or elements.
FIG. 1 shows a schematic diagram of a computing device 100, according to one embodiment of the invention;
FIG. 2 shows a flow diagram of a method 200 of detection of a high risk user according to one embodiment of the invention; and
FIG. 3 shows a schematic diagram of training a detection model according to one embodiment of the invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The method of activating software of the present invention is performed in a computing device. The computing device may be any device with storage and computing capabilities, and may be implemented as, for example, a server, a workstation, or the like, or may be implemented as a personal computer such as a desktop computer or a notebook computer, or may be implemented as a terminal device such as a mobile phone, a tablet computer, a smart wearable device, or an internet of things device, but is not limited thereto.
FIG. 1 shows a block diagram of a computing device 100, according to one embodiment of the invention. It should be noted that the computing device 100 shown in fig. 1 is only an example, and in practice, the computing device for implementing the method of activating software of the present invention may be any type of device, and the hardware configuration thereof may be the same as the computing device 100 shown in fig. 1 or different from the computing device 100 shown in fig. 1. In practice, the computing device implementing the method of activating software according to the present invention may add or delete hardware components of the computing device 100 shown in fig. 1, and the present invention is not limited to the specific hardware configuration of the computing device.
As shown in FIG. 1, in a basic configuration 102, a computing device 100 typically includes a system memory 106 and one or more processors 104. A memory bus 108 may be used for communication between the processor 104 and the system memory 106.
Depending on the desired configuration, the processor 104 may be any type of processing, including but not limited to: a microprocessor (μ P), a microcontroller (μ C), a Digital Signal Processor (DSP), or any combination thereof. The processor 104 may include one or more levels of cache, such as a level one cache 110 and a level two cache 112, a processor core 114, and registers 116. The example processor core 114 may include an Arithmetic Logic Unit (ALU), a Floating Point Unit (FPU), a digital signal processing core (DSP core), or any combination thereof. The example memory controller 118 may be used with the processor 104, or in some implementations the memory controller 118 may be an internal part of the processor 104.
Depending on the desired configuration, system memory 106 may be any type of memory, including but not limited to: volatile memory (such as RAM), non-volatile memory (such as ROM, flash memory, etc.), or any combination thereof. The physical memory in the computing device is usually referred to as a volatile memory RAM, and data in the disk needs to be loaded into the physical memory to be read by the processor 104. System memory 106 may include an operating system 120, one or more applications 122, and program data 124. In some implementations, the application 122 can be arranged to execute instructions on an operating system with program data 124 by one or more processors 104. Operating system 120 may be, for example, Linux, Windows, etc., which includes program instructions for handling basic system services and performing hardware dependent tasks. The application 122 includes program instructions for implementing various user-desired functions, and the application 122 may be, for example, but not limited to, a browser, instant messenger, a software development tool (e.g., an integrated development environment IDE, a compiler, etc.), and the like. When the application 122 is installed into the computing device 100, a driver module may be added to the operating system 120.
When the computing device 100 is started, the processor 104 reads program instructions of the operating system 120 from the memory 106 and executes them. The application 122 runs on top of the operating system 120, utilizing the operating system 120 and interfaces provided by the underlying hardware to implement various user-desired functions. When the user starts the application 122, the application 122 is loaded into the memory 106, and the processor 104 reads the program instructions of the application 122 from the memory 106 and executes the program instructions.
The computing device 100 also includes a storage device 132, the storage device 132 including removable storage 136 and non-removable storage 138, the removable storage 136 and the non-removable storage 138 each connected to the storage interface bus 134.
Computing device 100 may also include an interface bus 140 that facilitates communication from various interface devices (e.g., output devices 142, peripheral interfaces 144, and communication devices 146) to the basic configuration 102 via the bus/interface controller 130. The example output device 142 includes a graphics processing unit 148 and an audio processing unit 150. They may be configured to facilitate communication with various external devices, such as a display or speakers, via one or more a/V ports 152. Example peripheral interfaces 144 may include a serial interface controller 154 and a parallel interface controller 156, which may be configured to facilitate communication with external devices such as input devices (e.g., keyboard, mouse, pen, voice input device, touch input device) or other peripherals (e.g., printer, scanner, etc.) via one or more I/O ports 158. An example communication device 146 may include a network controller 160, which may be arranged to facilitate communications with one or more other computing devices 162 over a network communication link via one or more communication ports 164.
The network communication link may be one example of a communication medium. Communication media may typically be embodied by computer readable instructions, data structures, program modules, and may include any information delivery media, such as carrier waves or other transport mechanisms, in a modulated data signal. A "modulated data signal" may be a signal that has one or more of its data set or its changes made in such a manner as to encode information in the signal. By way of non-limiting example, communication media may include wired media such as a wired network or private-wired network, and various wireless media such as acoustic, Radio Frequency (RF), microwave, Infrared (IR), or other wireless media. The term computer readable media as used herein may include both storage media and communication media. In the computing device 100 according to the invention, the program data 124 comprises instructions for performing the detection method of a high risk user of the invention.
Because the existing time of various internet marketing activities is long or short, and the marketing activities have various playing methods, the data collected by the data analysts are various and uneven. As time goes on, the number and types of marketing activities of each platform are more and more, and the maintenance of hundreds of thousands of activities by the platform becomes a normal state, so how to efficiently identify high-risk users is an important subject of the wind control personnel.
The existing rule-based risk scoring method is generally based on the rule of expert experience and can be divided into a single rule and a combined rule, wherein the single rule is to use one judgment rule to complete judgment, and the combined rule needs a plurality of conditions or rule combination judgment. However, for risk scoring techniques that only use rules, there is hysteresis as the rules are determined by experts through business experience, and maintenance costs are high as the rules need to be continually iterated as business changes. Meanwhile, in the face of complex business and high-dimensional massive features, even a high-level expert can hardly ensure that the rule can cover all risk problems.
The existing risk scoring method based on artificial intelligence adopts artificial intelligence technologies such as machine learning and deep learning to perform feature processing and model training on service scene data, and risk assessment is performed according to model results. General business models are characterized in that feature engineering is carried out on a specific scene, then models are trained, and the pull-up marketing activities have the characteristics of long or short existence time and various activity playing methods, data collected by data analysis personnel are various and uneven, so that the common features of all activities are difficult to extract, the existence time or the length of the activities are short, the marketing activities are various, and the efficiency of training one model to identify risk users aiming at each activity is low.
Therefore, the invention creatively uses the user registration and login data as the characteristic engineering of the model to train the model, solves the diversity problem of various marketing activities, and can cover all marketing activities through one model. Besides using registration and login data to train a general model for anti-cheating of marketing activities in a characteristic engineering way, the invention also integrates expert rules aiming at the behavior characteristics of users in the target marketing activities, and realizes individual wind control aiming at different marketing activities.
Fig. 2 shows a flow diagram of a method 200 of detection of a high risk user according to one embodiment of the invention. Method 200 is suitable for execution in a computing device, such as computing device 100 described above. As shown in fig. 2, the method 200 begins at step S210.
In step S210, obtaining a first data sample of the first user, where the first data sample includes a registration behavior feature and a login behavior feature of the first user, and the registration behavior feature and the login behavior feature include at least one of a registration IP home, a registration time, a login IP home, a login time, and a login device, and may also include: any one or more of a nickname, an IP address, a mobile phone number attribution, a login IP address and the like are registered.
Log data about user registration and login are usually recorded in detail by an internet platform (e.g., an automobile media platform), and behavior characteristics of a user in a registration phase, such as a registration IP, a registration duration, a registration nickname, and the like, can be obtained through logs, and behavior characteristics of the user in a login phase, such as a login IP, a login duration, a login device (e.g., a PC, a mobile phone, H5), and the like, can also be obtained through logs.
Subsequently, in step S220, the first data sample is input into the trained detection model, and a probability value that predicts that the first user belongs to the high-risk user is output as the first risk score.
Next, how to train the detection model will be explained.
First, a second data sample set is obtained, where the second data sample set includes second data samples of multiple users, and the second data sample includes a registration behavior feature and a login behavior feature of one user, and whether the user belongs to a high-risk user, where whether the user belongs to the high-risk user is a label, and if the user belongs to the high-risk user, the label may be labeled as 1, and if the user does not belong to the high-risk user, the label may be labeled as 0. The registration behavior feature and the login behavior feature include at least one of a registration IP attribution, a registration time, a login IP attribution, a login time, and a login device, and may also include: any one or more of a nickname, an IP address, a mobile phone number attribution, a login IP address and the like are registered. The registration duration represents the time taken by the user to complete the registration process, and the login duration represents the time taken by the user to complete the login. Table one gives an exemplary second set of data samples:
table one:
Figure BDA0003592894270000081
Figure BDA0003592894270000091
the invention trains the detection model by collecting the registration and login related data of a large number of users as the input of the detection model, does not use the data related to specific marketing activities, and the output prediction result of the detection model is only related to the registration and login characteristics, so that the trained detection model can be universally used for various marketing activities, is not limited to specific marketing activities, can be used for grading the risks possibly existing in the users for different marketing activities, does not need to train different models for different marketing activities, improves the generalization degree of the models, and also improves the efficiency of detecting the risks.
Then, after the second data sample set is acquired, preprocessing of data is performed on second data samples of a plurality of users included in the second data sample set.
Regarding the preprocessing of the data, the screened registration behavior features and login behavior features can be subjected to data processing according to the sparsity degree of the data, including processing of abnormal values and missing values. Specifically, if the second data sample of one user in the second data sample set has a missing item, and the number of the missing items exceeds a fourth predetermined value of the total number of data included in the second data sample, the second data sample of the user is discarded.
If the second data sample of one user in the second data sample set has a missing entry, but the number of the missing entries does not exceed a fourth predetermined value of the total number of data included in the second data sample, the missing entry of the second data sample of the user is filled. Here, the fourth predetermined value may be set and adjusted according to the data perfection of the collected second data sample set by those skilled in the art, for example, the fourth predetermined value may be set to 50%, that is, when more than 50% of the second data samples are missing, the second data samples are discarded, and when the missing data in the second data samples is less than 50% of the total data contained in the second data samples, the missing data is filled. The higher the data integrity of the collected second data sample set, the lower the fourth predetermined value may be set. Of course, the person skilled in the art may set the fourth predetermined value according to other indexes, which is not limited by the present invention.
Alternatively, if the missing item of the second data sample of one user needs to be filled, the missing data can be filled in the following manner. First, it is determined whether the missing item belongs to a continuous type variable or a discrete type variable. And if the missing item in the second data sample of the user belongs to the continuous variable in the second data sample set, acquiring all values corresponding to the missing item from the second data sample set, performing mean calculation on all the values, and filling the missing item of the second data sample of the user by adopting a mean calculation result. For example, in the foregoing table one, there is a missing item (login duration) in the second data sample of the user 3, and if the missing item belongs to the continuous variable in the second data sample set, the average value (2.5 seconds) of the data (3 seconds for the user 1 and 2 seconds for the user 2) of the item of login duration of the other users is taken to fill in the missing item, and the login duration of the user 3 is filled in to 2.5 seconds.
And if the missing item in the second data sample of the user belongs to the discrete variable in the second data sample set, acquiring all values corresponding to the missing item from the second data sample set, and filling the missing item of the second data sample of the user by adopting a mode in all the values. For example, in the foregoing table one, there is a missing item (login duration) in the second data sample of the user 3, and assuming that the missing item belongs to a discrete variable in the second data sample set, the mode (2 seconds) of the data of the item of login duration of the other users (user 1 corresponds to 3 seconds, user 2 corresponds to 2 seconds, and it is assumed that there is another user 4 corresponding to 2 seconds) is taken to fill in the missing item, and the login duration of the user 3 is filled in to 2 seconds.
The data processing method for the second data sample set may further include: feature derivation, data segmentation and data encoding, etc. The characteristic derivation and the data segmentation are machine learning specific data processing means, the characteristic derivation can fully consider the interaction among factors, and the user behavior factor stronger than the original weak factor is derived through the existing characteristics.
Then, after the preprocessing of the data is completed for the second data samples of the plurality of users included in the second data sample set. And training the detection model by processing the completed second data sample set.
Dividing the processed second data sample set into a training set and a verification set according to an agreed proportion, for example, 80% of the second data samples in the second data sample set are used as the training set, and the remaining 20% are used as the verification set, and this proportion may be appropriately adjusted, which is not limited by the present invention. The training set can be modeled by using a tree model LightGBM algorithm which is a more classical tree model algorithm, because the structured data modeling performance of the LightGBM algorithm is not inferior to that of a deep learning model and the training speed is much faster than that of an XGboost algorithm, the invention preferably trains a binary model by using the LightGBM algorithm, and the output result of the model is the probability value of the first user belonging to a high-risk user, and the probability value is between 0 and 1.
FIG. 3 shows a schematic diagram of training a detection model according to one embodiment of the invention. As shown in fig. 3, X1, X2, and X3 … Xn are second data sample sets used for training the detection model, and include respective feature values of different users, one second data sample corresponds to one user, and X1, X2, and X3 … Xn represent second data samples corresponding to different users. And training the detection model until a preset condition is reached to obtain the trained detection model, wherein the preset condition is that the value of a loss function between a predicted value and a true value of the detection model is minimum or the accuracy of the detection model is not improved any more. Then, the new data sample (here by X) n+1 Represents, for example: a first data sample of a first user) into a trained detection model. The trained detection model outputs a probability value that predicts that the first user belongs to the high-risk user as a first risk score. For the detection model, in addition to the machine learning model, deep learning modeling may also be employed, such as: the depfm algorithm, but is not limited thereto. The tree model is selected, and the tree model has the advantage of strong interpretability on the premise of good algorithm effect.
Because the general business model is used for extracting data and performing characteristic engineering aiming at the current business link. The Internet marketing activities have the characteristics of diversity and inconsistent duration, so that the detection method for the high-risk users creatively extracts characteristics to train a detection model in the links of user registration and login (before participating in the marketing activities). Therefore, the common characteristic problem of different marketing activities can be solved, and the problem of user concurrence in the marketing activity link can be relieved as the registration and the login are performed before the user participates in the activities.
Subsequently, in step S230, expert rules of the targeted marketing campaign and the behavior characteristics of the first user in the targeted marketing campaign are acquired.
According to an embodiment of the invention, the trained detection model is scored separately from the expert rules. Regarding the scoring method by expert rules, for different marketing activities, the dimension data collected when the user participates in a specific target marketing activity is adopted as the input of the expert rules, and the score obtained by the expert rules is related to the current marketing activity. And adopting expert rules aiming at a specific target marketing activity aiming at different marketing activities, and collecting the behavior characteristics of the first user in the target marketing activity.
Subsequently, in step S240, the behavior characteristics of the first user in the targeted marketing campaign are evaluated based on the obtained expert rules, and a second risk score is obtained.
According to the embodiment of the invention, after a series of wind control strategies and matched rules are formed based on expert experience, the acquired behavior characteristics of the first user in the target marketing activity are subjected to risk scoring according to each expert rule, and if a plurality of expert rules exist, the acquired initial risk scoring based on each expert rule is accumulated. Then, the scores obtained after accumulation are subjected to logarithmic transformation and Sigmoid transformation, and are mapped to a [0,1] interval. Specifically, the second risk score may be calculated by the following formula:
Figure BDA0003592894270000121
wherein, rules _ score (x) i ) Representing user x i A second risk score of (a); r is j (x i ) Representing a user x i M is the total number of rules.
According to the embodiment of the invention, expert rules for a specific target marketing activity are adopted for different marketing activities, and the behavior characteristics of the first user in the target marketing activity are collected. Expert rules may be, for example, the following examples: rule 1, whether the number of users participating in the marketing campaign who are associated with the IP address is greater than 50; rule 2, whether the number of users participating in the marketing campaign and the number of users of the same equipment are more than 3; rule 3, whether the number of the users participating in the marketing campaign for sending the microblogs in 1 second is more than 3; rule 4, whether the number of users participating in the marketing campaign is greater than 3 in 1 second. Optionally, a fixed risk score is set for different expert rules according to different marketing campaigns, for example, rule 1 is set to 3 points, rule 2 is set to 5 points, and rule 3 is set to 4 points, for example, if the number of users associated with the IP address of the user participating in the marketing campaign is greater than 50, the risk score of rule 1 is 3 points, and if the number of users associated with the IP address of the user participating in the marketing campaign is less than 50, the risk score of rule 1 is not 3 points. The importance degree of the rule can be specifically set, and the higher the importance degree is, the larger the risk score can be set.
Optionally, the behavior characteristics of the user in the marketing campaign include, but are not limited to, an IP address used by the user when participating in the marketing campaign, a device unique identifier of a device used by the user when participating in the marketing campaign, a number of times the user has forwarded content (the content may include micro blogs, public articles, marketing messages, or the like) within a predetermined time while participating in the marketing campaign, a number of users that have been pulled up within a predetermined time while participating in the marketing campaign, and a number of popular friends within a predetermined time while participating in the marketing campaign. Here, the predetermined time may be 1 second, 1 minute, 1 hour, but is not limited thereto, and may be specifically set by those skilled in the art. For example, for a marketing campaign for forwarding a microblog lottery, the behavior characteristics of the first user in the target marketing campaign may be selected as an IP address used when the user participates in the marketing campaign, a device unique identifier of a device used when the user participates in the marketing campaign, and the number of times of forwarding a microblog within 1 second when the user participates in the marketing campaign. For the marketing campaign, expert rules (e.g., the aforementioned rules 1 to 3) corresponding to the marketing campaign are obtained. Then, based on the risk score obtained by the first user in each rule, a second risk score is calculated through the formula.
Subsequently, in step S250, it is determined whether the first user is a high risk user based on the first risk score and the second risk score.
Specifically, the first risk score and the second risk score are subjected to fusion processing to obtain a third risk score. The fusion process may be a weighted summation of the first risk score and the second risk score to obtain a third risk score as a final risk score. After the first risk score and the second risk score of the first user are obtained according to the foregoing steps S210 to S240, a third risk score may be generated by the following formula:
final_score(x i )=a*model_score(x i )+b*rules_score(x i )
a+b=1
wherein final _ score (x) i ) Representing a user x i Model _ score (x) of (c) i ) Representing a user x i The first risk score of (a), rules _ score (x) i ) Representing a user x i A is the weight of the trained detection model and b is the weight of the expert rules. and a and b can be flexibly adjusted according to different marketing activities so as to realize the purpose of personalized grading according to different marketing activities.
The settings on the weights may take into account the importance of the targeted marketing campaign, the interpretability of the detection model or expert rules, and the accuracy rate. For marketing activities with more information collection, expert rules of the marketing activities are more and more sophisticated, and higher interpretability is needed, in which case the weight of b is preferably set to be larger than that of a, for example, a and b are both 0.5 in a normal case, in which case the weight of b can be set to be 0.6, and the weight of a can be set to be 0.4. The more comprehensive the behavior characteristics of the first user in the targeted marketing campaign are collected, the higher the weight of the expert rules. The definition of whether the information collection is complete or not may be determined according to whether the acquired behavior feature of the first user in the targeted marketing activity has a missing value, and if the missing value exceeds a predetermined value of all behavior features, the information collection is deemed to be incomplete, and the predetermined value may be set by a person skilled in the art, for example, 20%, but is not limited thereto.
Optionally, after determining the third risk score of the first user, setting a threshold for the detection model and expert rule fusion score, and outputting the risk rating. Specifically, a risk rating result for the third risk score is output using the 3 σ criterion. In a normal distribution, σ represents the standard deviation and μ represents the mean. x ═ μ is the axis of symmetry of the image, and the 3 σ principle indicates that the probability of the numerical distribution in (μ - σ, μ + σ) is 0.6826; the probability of the numerical distribution in (μ -2 σ, μ +2 σ) is 0.9544; the probability of a numerical distribution in (μ -3 σ, μ +3 σ) is 0.9974, and therefore, the values of the third risk score can be considered to be almost entirely centered in the (μ -3 σ, μ +3 σ) ] interval, with a probability of less than 0.3% outside this range. Optionally, third risk scores of a plurality of first users are collected, the risk rating of the third risk score of the first user is determined according to a 3 σ criterion, and if the third risk score conforms to a rule of normal distribution, the threshold of the risk assessment scale may be set to be μ - σ, μ, μ + σ, four scales, where μ represents the average of all the third risk scores, σ represents the standard deviation of all the third risk scores, the user risk of the third risk score between 0 and μ - σ is lowest, the user risk of the third risk score between μ - σ and μ is lower, the user risk of the third risk score between μ and μ + σ is higher, and the user risk of the third risk score between μ + σ and 1 is highest.
Optionally, after determining the third risk score of the first user, it is determined whether the first user belongs to a high risk user according to a relationship between the third risk score and a first predetermined value. In particular, if the third risk score is above a first predetermined value, it may be determined that the first user belongs to a high risk user; otherwise, it may be determined that the first user does not belong to a high-risk user.
Regarding the selection of the first predetermined value, in order to ensure that the third risk score can not only ensure higher accuracy when identifying high-risk users, but also identify a certain number of high-risk users, it is ensured that the coverage rate is not too low. Here, the first predetermined value should not be set too large or too small. Since both the first risk score obtained through the trained detection model and the second risk score obtained through the expert rules are within the [0,1] interval, the final third risk score of the first user also falls within this interval after fusion. It is generally considered that 0.5 is a median value, i.e., a third risk score of greater than 0.5 indicates that the first user participating in the targeted marketing campaign is a high risk user, and a third risk score of less than 0.5 indicates that the first user participating in the targeted marketing achievement is a normal user. Alternatively, when the importance of the targeted marketing campaign is higher and/or the prize is larger, the first predetermined value may be set to be lower, for example, the first predetermined value is set to 0.4, the first user with the third risk score of 0.4 or more is determined as the high risk user, and the sponsor of the marketing campaign may set not to allow the high risk user to win after determining the high risk user. If the targeted marketing campaign is less important, the first predetermined value may be set higher, e.g., the third risk score is above 0.8 to be considered a high risk user.
According to the technical scheme of the invention, the detection method of the high-risk user provided by the invention is adopted in various marketing activities of each platform, so that the high-risk user possibly related to black grey production can be effectively identified, the behavior of the high-risk user in wool pulling is prevented, and the effect of the platform marketing activities can be greatly ensured. In marketing activities of various internet platforms, users in dark and gray products usually adopt a means of registering/logging in a large number of users in a large batch to achieve the purpose of pulling out most wool, the characteristics of the users in registration and logging are usually different from those of common users, users in dark and gray products have obvious aggregative characteristics, such as single IP (Internet protocol) registration/logging in a plurality of users, and therefore, it is reasonable and efficient to identify whether high-risk users exist through the behavior characteristics in the registration and logging stages. The behaviors of the black grey user in pulling wool may be different for different marketing campaigns, so that the high-risk user corresponding to the expert rule can be accurately identified for different behaviors or aggregative features in the marketing campaigns by the method of the expert rule.
In each internet platform, marketing activities are various, duration periods are long and short, activity rules, activity logics and user data collected through activities are different for different marketing activities, but users participating in marketing activities can leave data of registration and login behavior characteristics through a registration/login link of the platform, so that black and grey users with aggregative characteristics can be accurately identified by using characteristic modeling of the two links. And different expert rules can accurately determine black and grey users of different marketing activities, so that the purpose of individualized wind control for different marketing activities is achieved. The trained detection model has the characteristic of strong generalization capability, and the method for determining the risk score through the expert rule has the characteristics of high accuracy and strong pertinence. The detection method of the high-risk users through the model and the rule combines advantages and disadvantages of the high-risk users, forms advantage complementation, can flexibly set different thresholds aiming at different marketing activities, and increases stability and flexibility for the risk scores of the users participating in the marketing activities.
The various techniques described herein may be implemented in connection with hardware or software or, alternatively, with a combination of both. Thus, the methods and apparatus of the present invention, or certain aspects or portions thereof, may take the form of program code (i.e., instructions) embodied in tangible media, such as removable hard drives, U.S. disks, floppy disks, CD-ROMs, or any other machine-readable storage medium, wherein, when the program is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the invention.
In the case of program code execution on programmable computers, the computing device will generally include a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. Wherein the memory is configured to store program code; the processor is configured to execute the high risk user detection method of the present invention according to instructions in said program code stored in the memory.
By way of example, and not limitation, readable media may comprise readable storage media and communication media. Readable storage media store information such as computer readable instructions, data structures, program modules or other data. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. Combinations of any of the above are also included within the scope of readable media.
In the description provided herein, algorithms and displays are not inherently related to any particular computer, virtual system, or other apparatus. Various general purpose systems may also be used with examples of this invention. The required structure for constructing such a system will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.
In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
It should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
Those skilled in the art will appreciate that the modules or units or components of the devices in the examples disclosed herein may be arranged in a device as described in this embodiment or alternatively may be located in one or more devices different from the devices in this example. The modules in the foregoing examples may be combined into one module or may be further divided into multiple sub-modules.
Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.
Furthermore, some of the described embodiments are described herein as a method or combination of method elements that can be performed by a processor of a computer system or by other means of performing the described functions. A processor having the necessary instructions for carrying out the method or method elements thus forms a means for carrying out the method or method elements. Further, the elements of the apparatus embodiments described herein are examples of the following apparatus: the apparatus is used to implement the functions performed by the elements for the purpose of carrying out the invention.
As used herein, unless otherwise specified the use of the ordinal adjectives "first", "second", "third", etc., to describe a common object, merely indicate that different instances of like objects are being referred to, and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking, or in any other manner.
While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this description, will appreciate that other embodiments can be devised which do not depart from the scope of the invention as described herein. Furthermore, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. The present invention has been disclosed in an illustrative rather than a restrictive sense, and the scope of the present invention is defined by the appended claims.

Claims (10)

1. A method of detecting a high risk user, performed in a computing device, the method comprising the steps of:
acquiring a first data sample of a first user, wherein the first data sample comprises registration behavior characteristics and login behavior characteristics of the first user;
inputting the first data sample into a trained detection model, and outputting a probability value for predicting that the first user belongs to a high-risk user as a first risk score;
acquiring expert rules of a target marketing activity and behavior characteristics of a first user in the target marketing activity;
evaluating the behavior characteristics of the first user in the target marketing activity based on the obtained expert rules to obtain a second risk score;
determining whether the first user is a high risk user based on the first risk score and the second risk score.
2. The method of claim 1, wherein the step of determining whether the first user is a high risk user based on the first risk score and the second risk score comprises:
performing fusion processing on the first risk score and the second risk score to obtain a third risk score;
determining that the first user belongs to a high-risk user if the third risk score is above the first predetermined value;
otherwise, determining that the first user does not belong to a high-risk user.
3. The method of claim 1 or 2, wherein the second risk score is calculated by the formula:
Figure FDA0003592894260000011
wherein, rules _ score (x) i ) For user x i A second risk score of (a); r is j (x i ) For user x i Is the j-th expert rule risk score, m is the total number of rules.
4. The method of claim 1, wherein the trained detection model is generated by:
acquiring a second data sample set, wherein the second data sample set comprises second data samples of a plurality of users, and the second data samples comprise registration behavior characteristics and login behavior characteristics of the plurality of users and whether the users belong to high-risk users or not;
and training the detection model through the second data sample set until a preset condition is reached to obtain the trained detection model.
5. The method according to claim 5, wherein the predetermined condition is that the value of the loss function between the predicted value and the true value of the detection model is minimal or the accuracy of the detection model is no longer improved.
6. The method of claim 4 or 5, wherein prior to the step of training a detection model by the second set of data samples, further comprising:
the second set of data samples is preprocessed.
7. The method of claim 6, the step of preprocessing the second set of data samples comprising:
discarding the second data sample of a user in the second data sample set if the second data sample of the user has a missing item and the number of the missing item exceeds a fourth predetermined value of the total number of data included in the second data sample;
and if the second data sample of one user in the second data sample set has the missing item, but the number of the missing item does not exceed a fourth preset value of the total data included in the second data sample, filling the missing item of the second data sample of the user.
8. The method of claim 7, wherein the step of populating the missing entries of the second data sample of the user comprises:
if the missing item in the second data sample of the user belongs to the continuous variable in the second data sample set, all values corresponding to the missing item are obtained from the second data sample set, mean value calculation is carried out on all the values, and the missing item of the second data sample of the user is filled by adopting a mean value calculation result;
and if the missing item in the second data sample of the user belongs to the discrete variable in the second data sample set, acquiring all values corresponding to the missing item from the second data sample set, and filling the missing item of the second data sample of the user by adopting a mode in all the values.
9. A computing device, comprising:
one or more processors; and
a memory;
one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs comprising instructions for performing any of the methods of claims 1-8.
10. A computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing device, cause the computing device to perform any of the methods of claims 1-8.
CN202210383838.8A 2022-04-12 2022-04-12 Detection method of high-risk user, computing device and readable storage medium Pending CN114841526A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210383838.8A CN114841526A (en) 2022-04-12 2022-04-12 Detection method of high-risk user, computing device and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210383838.8A CN114841526A (en) 2022-04-12 2022-04-12 Detection method of high-risk user, computing device and readable storage medium

Publications (1)

Publication Number Publication Date
CN114841526A true CN114841526A (en) 2022-08-02

Family

ID=82564507

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210383838.8A Pending CN114841526A (en) 2022-04-12 2022-04-12 Detection method of high-risk user, computing device and readable storage medium

Country Status (1)

Country Link
CN (1) CN114841526A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116805266A (en) * 2023-08-25 2023-09-26 山东华创远智信息科技有限公司 Enterprise financial credit risk intelligent assessment method based on big data
CN117061252A (en) * 2023-10-12 2023-11-14 杭州智顺科技有限公司 Data security detection method, device, equipment and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116805266A (en) * 2023-08-25 2023-09-26 山东华创远智信息科技有限公司 Enterprise financial credit risk intelligent assessment method based on big data
CN117061252A (en) * 2023-10-12 2023-11-14 杭州智顺科技有限公司 Data security detection method, device, equipment and storage medium
CN117061252B (en) * 2023-10-12 2024-03-12 杭州智顺科技有限公司 Data security detection method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110009174B (en) Risk recognition model training method and device and server
CN106919579B (en) Information processing method, device and equipment
CN111435507A (en) Advertisement anti-cheating method and device, electronic equipment and readable storage medium
CN114841526A (en) Detection method of high-risk user, computing device and readable storage medium
CN107798027B (en) Information popularity prediction method, information recommendation method and device
CN111371767B (en) Malicious account identification method, malicious account identification device, medium and electronic device
CN112700252B (en) Information security detection method and device, electronic equipment and storage medium
CN107872436A (en) A kind of account recognition methods, apparatus and system
Zhou et al. Analyzing and detecting money-laundering accounts in online social networks
CN113011884B (en) Account feature extraction method, device, equipment and readable storage medium
CN112819024B (en) Model processing method, user data processing method and device and computer equipment
CN108596276A (en) The naive Bayesian microblog users sorting technique of feature based weighting
CN111951008A (en) Risk prediction method and device, electronic equipment and readable storage medium
CN113570398A (en) Promotion data processing method, model training method, system and storage medium
CN115965463A (en) Model training method and device, computer equipment and storage medium
CN107644268B (en) Open source software project incubation state prediction method based on multiple features
KR102223640B1 (en) Cloud-based personalized contents subscription service providing system and method thereof
CN115049397A (en) Method and device for identifying risk account in social network
Zhu et al. Click fraud detection of online advertising–LSH based tensor recovery mechanism
CN107908673A (en) The real relationship match method, apparatus and readable storage medium storing program for executing of social platform user
CN110061906B (en) Message issuing/receiving method
CN113763057A (en) User identity portrait data processing method and device
CN108460049A (en) A kind of method and system of determining information category
CN116865994A (en) Network data security prediction method based on big data
CN113409096B (en) Target object identification method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination