CN114399314A

CN114399314A - User detection method, device, equipment and medium

Info

Publication number: CN114399314A
Application number: CN202210015330.2A
Authority: CN
Inventors: 王硕; 姜娜; 杨康; 孙泽懿
Original assignee: Beijing Mininglamp Software System Co ltd
Current assignee: Beijing Mininglamp Software System Co ltd
Priority date: 2022-01-07
Filing date: 2022-01-07
Publication date: 2022-04-26

Abstract

The application relates to a user detection method, a device, equipment and a medium, wherein the detection method comprises the following steps: constructing a time sequence behavior sequence corresponding to each user based on behavior data of a plurality of users in a preset time period, wherein the behavior data is behavior information generated by each user in an advertisement promotion platform, and the time sequence behavior sequence is used for representing time sequence characteristics of behaviors of each user; inputting the time sequence behavior sequence of each user into a preset self-coding model to extract the depth characteristic of the time sequence behavior sequence of each user; and inputting all the depth features into a preset detection model to detect a plurality of users, and determining the users meeting the isolated condition in the detection result as target users with fraudulent behaviors. According to the method and the device, the detection data are constructed from multiple dimensions according to the dependency of user behaviors on time sequences, manual marking is not needed, and not only is the manpower consumption reduced, but also the detection data are more accurate.

Description

User detection method, device, equipment and medium

Technical Field

The present application relates to the field of marketing intelligence technologies, and in particular, to a method, an apparatus, a device, and a medium for user detection.

Background

From the development history of advertisements, the advertisement industry is just advancing along with the progress of media and technology, and along with the development of the media industry, more and more methods are derived for advertisement promotion, and a network also provides a plurality of promotion platforms for advertisers to put in propaganda, but as the promoted advertisements are more and more, the advertisement promotion market becomes mixed with fish and dragon, and even the behavior of users who utilize advertisements to make fraud appears. How to detect the fraudulent users under the large-scale data becomes more important, and a plurality of methods for detecting the fraudulent users also appear at present, for example, whether the traffic is abnormal or not is analyzed from the clicking and exposing behaviors of the users, but because the advertising fraud forms are various and constantly changed, it is difficult to accurately detect whether the traffic is abnormal or not.

Aiming at the problem that whether the flow is abnormal or not is difficult to accurately detect due to various and continuous changes of the advertising fraud forms, an effective solution is not provided at present.

Disclosure of Invention

The application provides a user detection method, a device, equipment and a medium, which are used for solving the technical problem that whether the flow is abnormal or not is difficult to accurately detect due to various and continuous changes of advertising fraud forms.

According to an aspect of an embodiment of the present application, there is provided a user detection method, including: constructing a time sequence behavior sequence corresponding to each user based on behavior data of a plurality of users in a preset time period, wherein the behavior data is behavior information generated by each user in an advertisement promotion platform, and the time sequence behavior sequence is used for representing time sequence characteristics of behaviors of each user; inputting the time sequence behavior sequence of each user into a preset self-coding model to extract the depth characteristic of the time sequence behavior sequence of each user; and inputting all the depth features into a preset detection model to detect a plurality of users, and determining the users meeting the isolated condition in the detection result as target users with fraudulent behaviors.

Optionally, constructing a time-series behavior sequence corresponding to each user based on behavior data of a plurality of users in a preset time period includes: acquiring behavior data of a plurality of users according to a preset sampling interval in a preset time period; establishing a time relation mapping table of each user and respective behavior data according to the sequence of the occurrence time by taking the user identification of each user as an index; and determining a behavior field and matching the behavior data of each user according to the behavior field based on the time relation mapping table to obtain a time sequence behavior sequence of a plurality of users.

Optionally, determining a behavior field and matching the behavior data of each user according to the behavior field based on the time relation mapping table includes processing a missing value as follows: selecting a preset missing value for filling under the condition that the number of the missing values in the time sequence behavior sequence of each user does not exceed the missing value threshold; and directly deleting the time sequence behavior sequence of the user to which the missing value belongs under the condition that the number of the missing values in the time sequence behavior sequence of each user exceeds the missing value threshold value.

Optionally, the method further comprises training the preset self-coding model as follows: acquiring a sample set, wherein the sample set comprises a training set and a verification set, and the training set and the verification set respectively comprise a plurality of time sequence behavior sequence samples; inputting a training set into an initial self-coding model for training to obtain an intermediate self-coding model, wherein the initial self-coding model comprises an initial coding module and an initial decoding module; verifying the intermediate self-coding model by using a verification set, and determining the intermediate self-coding model as a self-encoder model when the verification result of the intermediate self-coding model on the verification set indicates that the feature identification accuracy of the intermediate self-coding model reaches a target threshold value; and when the verification result of the intermediate self-coding model to the verification set indicates that the feature recognition accuracy of the intermediate self-coding model does not reach the target threshold value, continuing to use the verification set to train the intermediate self-coding model until the feature recognition accuracy of the intermediate self-coding model reaches the target threshold value, and determining the intermediate model as a self-encoder model.

Optionally, inputting the training set into the initial self-encoder model for training, to obtain an intermediate self-encoding model, including: inputting a first time sequence behavior sequence in a training set into a first full-connection layer to obtain a first output vector; inputting the first output vector into an initial encoder for encoding to obtain a first hidden layer state, wherein an initial encoding module comprises a first full-link layer and the initial encoder; inputting the first hidden layer state into an initial decoder for decoding to obtain a second hidden layer state; inputting a second hidden layer state into a second fully-connected layer to map the second hidden layer state into a second sequence of temporal behaviors, wherein the initial decoding module comprises an initial decoder and the second fully-connected layer; determining a loss value of the target function by using a difference value between the first time sequence behavior sequence and the second time sequence behavior sequence; and adjusting network parameters in the first full-link layer, the initial encoder, the initial decoder and the second full-link layer according to the loss value to obtain an intermediate self-coding model.

Optionally, verifying the intermediate self-coding model with the verification set comprises: inputting the third time sequence behavior sequence in the verification set into the intermediate self-coding model, and acquiring a fourth time sequence behavior sequence output by a second full-connection layer in the intermediate self-coding model; determining the similarity of the third time sequence behavior sequence and the fourth time sequence behavior sequence; and determining the similarity as a verification result.

Optionally, continuing to train the intermediate self-encoding model using the validation set comprises: determining a loss value of the target function by using the similarity; and continuously adjusting network parameters in the first full connection layer, the initial encoder, the initial decoder and the second full connection layer according to the loss value to obtain an intermediate self-encoding model, and determining the intermediate model as the self-encoder model until the feature identification accuracy of the intermediate self-encoding model reaches a target threshold value.

Optionally, the method further includes training the preset detection model according to the following steps:

step 1: obtaining a detection sample set, and randomly selecting a plurality of sub-sample sets from the detection sample set, wherein the detection sample set comprises a plurality of depth features;

step 2: randomly determining a designated dimension, and randomly generating a cutting point in a plurality of sub-sample sets of the current node, wherein the dimension of the cutting point is between the maximum value and the minimum value of the designated dimension of the plurality of sub-sample sets of the current node;

and step 3: generating a hyperplane according to the cut points, and dividing a plurality of sub-sample set spaces of the current node into 2 subspaces;

and 4, step 4: and recursion steps 2 and 3 in the child nodes, and continuously constructing new child nodes until a termination condition is met, wherein the termination condition is that the child nodes only contain one depth feature or the child nodes reach a defined height.

Optionally, inputting all the depth features into a preset detection model to detect a plurality of users includes: inputting the time sequence behavior sequence of each user into a preset self-coding model, and outputting a third hidden layer state through a coder in the preset self-coding model; and inputting the third hidden layer state into a preset detection model to identify a target user with fraudulent behaviors, wherein the depth feature comprises the third hidden layer state.

According to another aspect of the embodiments of the present application, there is provided a user detection apparatus, including: the advertisement promotion platform comprises a construction module, a time sequence behavior sequence and a display module, wherein the construction module is used for constructing the time sequence behavior sequence corresponding to each user based on behavior data of a plurality of users in a preset time period, the behavior data is behavior information generated by each user in the advertisement promotion platform, and the time sequence behavior sequence is used for representing time sequence characteristics of behaviors of each user; the extraction module is used for inputting the time sequence behavior sequence of each user into a preset self-coding model so as to extract the depth characteristic of the time sequence behavior sequence of each user; and the detection module is used for inputting all the depth features into a preset detection model to detect a plurality of users and determining the users meeting the isolated conditions in the detection result as target users with fraudulent behaviors.

According to another aspect of the embodiments of the present application, there is provided an electronic device, including a memory, a processor, a communication interface, and a communication bus, where the memory stores a computer program executable on the processor, and the memory and the processor communicate with each other through the communication bus and the communication interface, and the processor implements the steps of the method when executing the computer program.

According to another aspect of embodiments of the present application, there is provided a computer readable medium having non-volatile program code executable by a processor, the program code causing the processor to perform the steps of any of the methods described above.

The technical scheme of the application can be applied to the design of forecasting and optimizing of the marketing intelligent technology.

Compared with the related art, the technical scheme provided by the embodiment of the application has the following advantages:

the application provides a user detection method, which comprises the following steps: constructing a time sequence behavior sequence corresponding to each user based on behavior data of a plurality of users in a preset time period, wherein the behavior data is behavior information generated by each user in an advertisement promotion platform, and the time sequence behavior sequence is used for representing time sequence characteristics of behaviors of each user; inputting the time sequence behavior sequence of each user into a preset self-coding model to extract the depth characteristic of the time sequence behavior sequence of each user; and inputting all the depth features into a preset detection model to detect a plurality of users, and determining the users meeting the isolated condition in the detection result as target users with fraudulent behaviors. According to the method and the device, the detection data are constructed from multiple dimensions according to the dependency of user behaviors on time sequences, manual marking is not needed, and not only is the manpower consumption reduced, but also the detection data are more accurate.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

In order to more clearly illustrate the technical solutions in the embodiments or related technologies of the present application, the drawings needed to be used in the description of the embodiments or related technologies will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings without any creative effort.

Fig. 1 is a flowchart of an alternative user detection method provided in an embodiment of the present application;

FIG. 2 is a flow chart of an alternative method for training a default self-coding model according to an embodiment of the present application;

FIG. 3 is a schematic diagram of an alternative user detection apparatus provided in accordance with an embodiment of the present application;

fig. 4 is a schematic structural diagram of an alternative electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In the following description, suffixes such as "module", "component", or "unit" used to denote elements are used only for the convenience of description of the present application, and have no specific meaning in themselves. Thus, "module" and "component" may be used in a mixture.

Along with the development of the internet and mobile application, the scale of the mobile advertisement market is also continuously expanded, a large number of advertisers can invest a large amount of advertisement funds on the mobile advertisement market for the promotion and marketing of self brand products to carry out the propaganda and the marketing of the products, meanwhile, a plurality of lawbreakers begin to invade the mobile advertisement market to carry out advertisement fraud and maliciously consume the advertisement funds of the advertisers in order to obtain benefits. To maintain a benign growth in the mobile advertising market and the interests of advertisers, many scholars and businesses have begun to focus on the research of advertising anti-fraud.

Advertisement anti-fraud usually analyzes whether the flow is abnormal or not from the clicking and exposing behaviors of users, and because the advertisement fraud forms are various and constantly changed, the labels whether the flow is abnormal or not in a real advertisement fraud scene is difficult to mark by experts, so that many scholars adopt unsupervised learning to find cheating flow. The advertisement fraud detection method based on unsupervised learning does not need label information about whether the traffic is abnormal, but mines hidden patterns through analysis of characteristic variables. For example, by analyzing characteristics of users, media and clicking behaviors and by a K-mean clustering algorithm, the unlabeled advertisement traffic data is divided into clusters of normal traffic and abnormal traffic, so that clusters of the abnormal traffic are found, but the normal traffic and the abnormal traffic are only divided by the user behaviors, and the clustering is too unilateral.

In order to solve the problems mentioned in the background, according to an aspect of an embodiment of the present application, there is provided a user detection method, as shown in fig. 1, including:

step 11, constructing a time sequence behavior sequence corresponding to each user based on behavior data of a plurality of users in a preset time period, wherein the behavior data is behavior information generated by each user in an advertisement promotion platform, and the time sequence behavior sequence is used for representing time sequence characteristics of behaviors of each user.

Specifically, behavior information generated by a plurality of users in an advertisement promotion platform in a preset time period is obtained, and the whole behavior information is integrated and split to obtain a time sequence behavior sequence of each user.

And step 12, inputting the time sequence behavior sequence of each user into a preset self-coding model so as to extract the depth feature of the time sequence behavior sequence of each user.

Specifically, the model is pre-trained to obtain a preset self-coding model, so that the preset self-coding model can be highly consistent with input data after depth features are extracted. The preset self-coding model comprises a trained encoder and a trained decoder, the decoder is used for outputting results to verify the accuracy of the decoder in extracting the depth features, and in the implementation process of the scheme, only a time sequence behavior sequence needs to be input into the trained encoder to obtain the hidden layer state, namely the depth features needing to be extracted.

Optionally, the depth feature is a deeper and more abstract feature obtained after the original feature is subjected to modeling processing, in short, the original feature is often understandable by a user but not understandable by a model, and the depth feature is relatively more suitable for model processing.

And step 13, inputting all the depth features into a preset detection model to detect a plurality of users, and determining the users meeting the isolated conditions in the detection result as target users with fraudulent behaviors.

Specifically, the preset detection model is a detection model trained in advance based on an isolated forest and used for detecting abnormal users in a plurality of user sample data.

Optionally, the depth features of the multiple users are input into a preset detection model, and the prediction detection model isolates abnormal users with great differences between the depth features and other users, so that abnormal users with fraudulent behaviors are detected.

According to the scheme, the time sequence behavior sequence is constructed by combining the user behaviors with the time sequence, the depth characteristics are extracted, and finally the abnormal users are selected through the isolated forest, namely, the detection data are constructed from multiple dimensions directly according to the dependency of the user behaviors on the time sequence, manual marking is not needed, so that the labor consumption is reduced, and the detection data are more accurate.

As an alternative embodiment, constructing a time-series behavior sequence corresponding to each user based on behavior data of a plurality of users in a preset time period includes: acquiring behavior data of a plurality of users according to a preset sampling interval in a preset time period; establishing a time relation mapping table of each user and respective behavior data according to the sequence of the occurrence time by taking the user identification of each user as an index; and determining a behavior field and matching the behavior data of each user according to the behavior field based on the time relation mapping table to obtain a time sequence behavior sequence of a plurality of users.

Specifically, the preset time period can be selected according to the detection scene and the detection requirement, a plurality of sampling points are generated in the preset time period according to the preset sampling interval, and behavior data are collected at each sampling point. For example, the preset time period is 13: 00-21: 00(8 hours), presetting a sampling interval to be 1 hour, and generating 9 sampling points under the condition that the starting and ending time is not abandoned (the starting and ending time can be reserved or abandoned), namely, acquiring behavior data for 9 times, wherein the behavior data of each acquisition point represents the behavior occurrence data of the time period.

Specifically, the behavior data of each user carries a user identification identifier of the user, after the behavior data is collected, the behavior data collected by each user at each preset sampling interval is searched and integrated by taking the user identification identifier as an index, and a time relation mapping table of each user and the respective behavior data is generated according to the time sequence of behavior occurrence.

Specifically, after the time relation mapping table is generated, the behavior field is determined and the time relation mapping table is matched according to the behavior field, that is, the data acquired at each preset sampling interval is matched according to the behavior field to generate the time sequence behavior sequence of each user, so that the time sequence behavior sequences of a plurality of users are obtained.

By way of example, the behavioral fields may be exposure, click-through, associated IP (Internet Protocol), associated campaign, associated territory, associated media, associated device, associated advertiser, associated product industry, associated operating system, associated advertisement type, and associated operating system version, and the like.

As an alternative embodiment, determining the behavior field and matching the behavior data of each user according to the behavior field based on the time relationship mapping table includes processing the missing value as follows: selecting a preset missing value for filling under the condition that the number of the missing values in the time sequence behavior sequence of each user does not exceed the missing value threshold; and directly deleting the time sequence behavior sequence of the user to which the missing value belongs under the condition that the number of the missing values in the time sequence behavior sequence of each user exceeds the missing value threshold value.

Specifically, the missing value processing of the behavior data comprises the following steps:

step 1, determining the number of missing values of a user and behavior fields to which the missing values belong;

and 2, selecting a preset missing value for filling under the condition that the number of the missing values of the user does not exceed the missing value threshold, and directly deleting the time sequence behavior sequence of the user to which the missing value belongs under the condition that the number of the missing values of the user exceeds the missing value threshold.

The default value threshold may be a fixed number or a percentage, the preset default value may be filled with unknown characters, or may be filled with data that appears in the behavior field most frequently by other users. For example, there are 100 data in the time series behavior sequence, the missing value threshold may be 10 (i.e. 1/10), and in the case that the missing value exceeds 10, the time series behavior sequence of the user to which the missing value belongs is directly deleted, because the data may be considered to be incomplete in the case that the missing value exceeds 10, and the direct deletion may reduce interference on subsequent operations; and under the condition that the number of the missing values is not more than 10, selecting the data with the largest occurrence number in the behavior field of the missing value to fill or directly filling the data with 'unk'.

The scheme processes the missing value, can reduce the interference of incomplete data on data detection, and improves the accuracy of the detection result to a certain extent.

As an alternative embodiment, as shown in fig. 2, the method further includes training the preset self-coding model as follows:

step 21, obtaining a sample set, where the sample set includes a training set and a verification set, and the training set and the verification set both include a plurality of time-series behavior sequence samples.

And step 22, inputting the training set into an initial self-coding model for training to obtain an intermediate self-coding model, wherein the initial self-coding model comprises an initial coding module and an initial decoding module.

Specifically, the encoding result is obtained by the training set through an initial self-encoding model, wherein the dimension of the encoding result is smaller than that of the training set, and the decoding result is obtained by the encoding result through an initial decoding module.

And step 23, verifying the intermediate self-coding model by using the verification set, and determining the intermediate self-coding model as the self-encoder model when the verification result of the intermediate self-coding model to the verification set indicates that the feature identification accuracy of the intermediate self-coding model reaches a target threshold value.

Specifically, the initial self-coding model is trained to obtain an intermediate self-coding model, and the verification set is input into the intermediate self-coding model for verification. And inputting the verification set into the intermediate self-coding model to obtain a verification result, wherein the verification result is obtained by comparing the final output result of the intermediate self-coding model with the verification set in terms of similarity to obtain the similarity between the intermediate self-coding model and the verification set, and the higher the similarity is, the higher the feature identification accuracy of the intermediate self-coding model is. And when the similarity between the final output result of the intermediate self-coding model and the verification set reaches a target threshold value, determining the intermediate self-coding model as a self-encoder model.

Alternatively, since the training of the self-encoding model is intended to maintain a high degree of similarity between the input data and the output data, the target threshold may be any number close to 100% (e.g., 99%).

And 24, when the verification result of the intermediate self-coding model to the verification set indicates that the feature identification accuracy of the intermediate self-coding model does not reach the target threshold value, continuing to use the verification set to train the intermediate self-coding model until the feature identification accuracy of the intermediate self-coding model reaches the target threshold value, and determining the intermediate model as a self-encoder model.

Optionally, when the feature recognition accuracy of the intermediate self-coding model does not reach the target threshold, it indicates that the recognition accuracy of the intermediate self-coding model does not yet meet the requirement of the scheme on the recognition accuracy of the self-coding model, and at this time, the verification set is continuously used to train the intermediate self-coding model until the recognition accuracy of the intermediate self-coding model reaches the target threshold.

As an alternative embodiment, inputting the training set into the initial self-encoder model for training, and obtaining the intermediate self-encoding model, includes the following steps:

step 1, inputting a first time sequence behavior sequence in a training set into a first full-connection layer to obtain a first output vector.

Optionally, the first sequence of timing behaviors is mapped to a higher dimensional space by changing the first sequence of timing behaviors into a vectorized representation through the first fully-connected layer.

And 2, inputting the first output vector into an initial encoder for encoding to obtain a first hidden layer state, wherein the initial encoding module comprises a first full-link layer and the initial encoder.

Optionally, the first output vector is input into an initial encoder and encoded by applying a GRU layer (gated round Unit) to obtain a first hidden layer state, which we can understand is a compressed representation of the first sequence of time-series behaviors.

And 3, inputting the first hidden layer state into an initial decoder for decoding to obtain a second hidden layer state.

And 4, inputting the state of the second hidden layer into a second full-connection layer so as to map the state of the second hidden layer into a second time sequence behavior sequence, wherein the initial decoding module comprises an initial decoder and the second full-connection layer.

Optionally, in step 3 and step 4, the first hidden layer state is reversely decoded to generate a second time-series behavior sequence with high consistency with the first time-series behavior sequence, and the decoder in step 3 may still apply the GRU layer for decoding. The self-coding model of the GRU layer can be applied to better capture the dependency relationship between the behaviors and the time in the time behavior sequence.

And 5, determining a loss value of the target function by using a difference value between the first time sequence behavior sequence and the second time sequence behavior sequence.

Specifically, a difference value between the first time sequence behavior sequence and the second time sequence behavior sequence is detected, and a loss value of the objective function is determined based on the target threshold and the difference value, wherein an optional calculation formula of the loss value includes: loss value is difference value- (100% -target threshold). For example, if the target threshold is 99% and the difference is 2%, then the loss value is 1%.

It should be noted that the loss value may be a negative number indicating that the recognition accuracy of the self-coding model has exceeded the target threshold.

And 6, adjusting network parameters in the first full connection layer, the initial encoder, the initial decoder and the second full connection layer according to the loss value to obtain an intermediate self-coding model.

Alternatively, the larger the difference value, the larger the adjustment of the network parameters of the first fully-connected layer, the initial encoder, the initial decoder, and the second fully-connected layer is proved to be required, and the training times can be increased appropriately.

As an alternative embodiment, the verification of the intermediate self-coding model by using the verification set comprises the following steps:

and step 1, inputting the third time sequence behavior sequence in the verification set into the intermediate self-coding model, and acquiring a fourth time sequence behavior sequence output by a second full-connection layer in the intermediate self-coding model.

Specifically, the intermediate self-coding module comprises an intermediate coding module and an intermediate decoding module, wherein the intermediate coding module comprises a first full-link layer and an intermediate coder, the intermediate decoding module comprises an initial coder and a second full-link layer, a third time sequence behavior sequence of the verification set is input into the intermediate self-coding model, and a fourth time sequence behavior sequence is output after sequentially passing through the first full-link layer, the initial coder and the second full-link layer.

And 2, determining the similarity of the third time sequence behavior sequence and the fourth time sequence behavior sequence.

And step 3, determining the similarity as a verification result.

As an alternative embodiment, the training of the intermediate self-encoding model using the validation set comprises: determining a loss value of the target function by using the similarity; and continuously adjusting network parameters in the first full connection layer, the initial encoder, the initial decoder and the second full connection layer according to the loss value to obtain an intermediate self-encoding model, and determining the intermediate model as the self-encoder model until the feature identification accuracy of the intermediate self-encoding model reaches a target threshold value.

Optionally, a loss value of the objective function is determined based on the similarity and the target threshold, wherein an optional calculation formula of the loss value includes: loss value-target threshold-similarity.

Specifically, when the loss value is greater than 0, network parameters in the first fully-connected layer, the initial encoder, the initial decoder, and the second fully-connected layer are continuously adjusted to obtain an intermediate self-encoding model, and the intermediate model is determined as the self-encoder model until the loss value is less than or equal to 0, that is, the feature identification accuracy of the intermediate self-encoding model reaches a target threshold value.

According to the method and the device, the depth characteristics of the time sequence behavior sequence of the user are extracted through the self-coding model based on the GRU, so that the high consistency of the input sequence and the output sequence of the self-coding model is kept, the depth characteristics suitable for detecting the model are extracted, and the efficiency is higher than that of manually marking the abnormity.

As an optional embodiment, the method further includes training the preset detection model according to the following steps:

Optionally, an isolated Forest-based algorithm is used for training a preset detection model, where the preset detection model is composed of N isolated binary trees, where an isolated Forest (Isolation Forest) is used for mining abnormal data, or outlier mining, and in a large pile of data, data that does not conform to the rules of other data is found.

In particular, the specified dimension may be any of an exposure, a click through, an associated IP (Internet Protocol), an associated campaign, an associated territory, an associated media, an associated device, an associated advertiser, an associated product industry, an associated operating system, an associated advertisement type, and an associated operating system version.

As an alternative embodiment, inputting all depth features into the preset detection model to detect multiple users includes: inputting the time sequence behavior sequence of each user into a preset self-coding model, and outputting a third hidden layer state through a coder in the preset self-coding model; and inputting the third hidden layer state into a preset detection model to identify the target user with the fraudulent behavior, wherein the depth feature comprises the third hidden layer state.

Specifically, the third hidden layer states corresponding to a plurality of users are input into a preset detection model, and a target user with a fraudulent behavior is determined according to an isolated result output by the preset detection model, wherein the fraudulent user meets a preset isolated condition, and the isolated condition is that a newly constructed child node only contains the depth feature of one user or the current child node reaches a limited height.

Optionally, the target user satisfies the following condition: the target user is sufficiently different from other non-target users; the number of target users satisfies the following condition: the target user accounts for a small percentage (e.g., 7%) of the user volume corresponding to the input first hidden layer state; the target user can be detected by a small number of cuts because the child node has substantially reached the defined height (it can be determined that the detected samples are all normal users) if the target user is sufficiently different from other non-target users and requires a large number of cuts.

Illustratively, the third hidden layer state is input into a preset detection model to identify a target user with fraudulent behavior. For example, if the dimension is designated as the click volume, the maximum value and the minimum value of the click volume are 500 and 20 respectively, the click volume of the randomly generated cut point falls between 20 and 500 (20 and 500 are not included), and if the click volume of the cut point is 200, the hyperplane generated by the cut point divides the current child node into two parts, the dimensions of the click volume are (20, 200) and (200, 500) respectively, then new cut points are continuously selected to divide new child nodes until the child node finally appears and only contains the click volume of one user, the division is stopped, and the user is determined as the target user with fraudulent behaviors.

The target user with the fraudulent behavior, which has obvious difference with the behavior data of the normal user, can be quickly and accurately detected through the preset detection model based on the isolated forest.

According to another aspect of the embodiments of the present application, as shown in fig. 3, the present application provides a user detection apparatus, including:

the constructing module 31 is configured to construct a time sequence behavior sequence corresponding to each user based on behavior data of a plurality of users in a preset time period, where the behavior data is behavior information generated by each user in an advertisement promotion platform, and the time sequence behavior sequence is used to represent time sequence characteristics of behaviors of each user;

the extraction module 32 is configured to input the time sequence behavior sequence of each user into a preset self-coding model, so as to extract a depth feature of the time sequence behavior sequence of each user;

and the detection module 33 is configured to input all depth features into a preset detection model to detect multiple users, and determine a user meeting an isolated condition in a detection result as a target user with a fraud behavior.

It should be noted that the building module 31 in this embodiment may be configured to execute step 11 in this embodiment, the extracting module 32 in this embodiment may be configured to execute step 12 in this embodiment, and the detecting module 33 in this embodiment may be configured to execute step 13 in this embodiment.

Optionally, the building module 31 is further configured to collect behavior data of a plurality of users according to a preset sampling interval within a preset time period; establishing a time relation mapping table of each user and respective behavior data according to the sequence of the occurrence time by taking the user identification of each user as an index; and determining a behavior field and matching the behavior data of each user according to the behavior field based on the time relation mapping table to obtain a time sequence behavior sequence of a plurality of users.

Optionally, the building module 31 is further configured to process the missing value as follows: selecting a preset missing value for filling under the condition that the number of the missing values in the time sequence behavior sequence of each user does not exceed the missing value threshold; and directly deleting the time sequence behavior sequence of the user to which the missing value belongs under the condition that the number of the missing values in the time sequence behavior sequence of each user exceeds the missing value threshold value.

Optionally, the user detection apparatus further includes a preset self-coding model training module, configured to obtain a sample set, where the sample set includes a training set and a verification set, and the training set and the verification set both include a plurality of time-series behavior sequence samples; inputting a training set into an initial self-coding model for training to obtain an intermediate self-coding model, wherein the initial self-coding model comprises an initial coding module and an initial decoding module; verifying the intermediate self-coding model by using a verification set, and determining the intermediate self-coding model as a self-encoder model when the verification result of the intermediate self-coding model on the verification set indicates that the feature identification accuracy of the intermediate self-coding model reaches a target threshold value; and when the verification result of the intermediate self-coding model to the verification set indicates that the feature recognition accuracy of the intermediate self-coding model does not reach the target threshold value, continuing to use the verification set to train the intermediate self-coding model until the feature recognition accuracy of the intermediate self-coding model reaches the target threshold value, and determining the intermediate model as a self-encoder model.

Optionally, the preset self-coding model training module is further configured to input the first sequence of timing behaviors in the training set into the first full connection layer to obtain a first output vector; inputting the first output vector into an initial encoder for encoding to obtain a first hidden layer state, wherein an initial encoding module comprises a first full-link layer and the initial encoder; inputting the first hidden layer state into an initial decoder for decoding to obtain a second hidden layer state; inputting a second hidden layer state into a second fully-connected layer to map the second hidden layer state into a second sequence of temporal behaviors, wherein the initial decoding module comprises an initial decoder and the second fully-connected layer; determining a loss value of the target function by using a difference value between the first time sequence behavior sequence and the second time sequence behavior sequence; and adjusting network parameters in the first full-link layer, the initial encoder, the initial decoder and the second full-link layer according to the loss value to obtain an intermediate self-coding model.

Optionally, the preset self-coding model training module is further configured to input the third time sequence behavior sequence in the verification set into the intermediate self-coding model, and obtain a fourth time sequence behavior sequence output by a second full-connection layer in the intermediate self-coding model; determining the similarity of the third time sequence behavior sequence and the fourth time sequence behavior sequence; and determining the similarity as a verification result.

Optionally, the preset self-coding model training module is further configured to determine a loss value of the objective function by using the similarity; and continuously adjusting network parameters in the first full connection layer, the initial encoder, the initial decoder and the second full connection layer according to the loss value to obtain an intermediate self-encoding model, and determining the intermediate model as the self-encoder model until the feature identification accuracy of the intermediate self-encoding model reaches a target threshold value.

Optionally, the user detection apparatus further includes a preset detection model training module, configured to train the preset detection model according to the following steps:

Optionally, the preset detection model training module is further configured to input the time sequence behavior sequence of each user into a preset self-encoding model, and output a third hidden layer state through an encoder therein; and inputting the third hidden layer state into a preset detection model to identify a target user with fraudulent behaviors, wherein the depth feature comprises the third hidden layer state.

It should be noted here that the modules described above are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to the disclosure of the above embodiments.

According to another aspect of the embodiments of the present application, as shown in fig. 4, the present application provides an electronic device, which includes a memory 41, a processor 42, a communication interface 43 and a communication bus 44, wherein a computer program operable on the processor 42 is stored in the memory 41, the memory 41 and the processor 42 communicate through the communication bus 44 and the communication interface 43, and the steps of the method are implemented when the processor 42 executes the computer program.

The memory and the processor in the electronic equipment are communicated with the communication interface through a communication bus. The communication bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc.

The Memory may include a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.

According to another aspect of embodiments of the present application, there is also provided a computer-readable medium having non-volatile program code executable by a processor, the program code causing the processor to perform any of the methods described above.

Optionally, in an embodiment of the present application, a computer readable medium is configured to store program code for the processor to perform the above method steps.

Optionally, the specific examples in this embodiment may refer to the examples described in the above embodiments, and this embodiment is not described herein again.

When the embodiments of the present application are specifically implemented, reference may be made to the above embodiments, and corresponding technical effects are achieved.

It is to be understood that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or any combination thereof. For a hardware implementation, the Processing units may be implemented within one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), general purpose processors, controllers, micro-controllers, microprocessors, other electronic units configured to perform the functions described herein, or a combination thereof.

For a software implementation, the techniques described herein may be implemented by means of units performing the functions described herein. The software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical division, and in actual implementation, there may be other divisions, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially implemented or make a contribution to the prior art, or may be implemented in the form of a software product stored in a storage medium and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a U disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk. It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above description is merely exemplary of the present application and is presented to enable those skilled in the art to understand and practice the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for user detection, comprising:

constructing a time sequence behavior sequence corresponding to each user based on behavior data of a plurality of users in a preset time period, wherein the behavior data is behavior information generated by each user in an advertisement promotion platform, and the time sequence behavior sequence is used for representing time sequence characteristics of behaviors of each user;

inputting the time sequence behavior sequence of each user into a preset self-coding model to extract the depth feature of the time sequence behavior sequence of each user;

and inputting all the depth features into a preset detection model to detect the plurality of users, and determining the users meeting the isolated condition in the detection result as target users with fraudulent behaviors.

2. The method of claim 1, wherein the constructing the time-series behavior sequence corresponding to each user based on the behavior data of the plurality of users in the preset time period comprises:

collecting the behavior data of the plurality of users according to a preset sampling interval in the preset time period;

establishing a time relation mapping table of each user and the respective behavior data according to the sequence of the occurrence time by taking the user identification of each user as an index;

determining a behavior field and matching the behavior data of each user according to the behavior field based on the time relation mapping table to obtain the time sequence behavior sequence of the multiple users.

3. The method of claim 2, wherein determining a behavior field and matching the behavior data for each user according to the behavior field based on the temporal relationship mapping table comprises processing missing values as follows:

selecting a preset missing value for filling under the condition that the number of the missing values in the time sequence behavior sequence of each user does not exceed a missing value threshold;

and directly deleting the time sequence behavior sequence of the user to which the missing value belongs when the number of the missing values in the time sequence behavior sequence of each user exceeds a missing value threshold value.

4. The method according to any of claims 1-3, further comprising training the pre-set self-coding model as follows:

acquiring a sample set, wherein the sample set comprises a training set and a verification set, and the training set and the verification set respectively comprise a plurality of time-series behavior sequence samples;

inputting the training set into an initial self-coding model for training to obtain an intermediate self-coding model, wherein the initial self-coding model comprises an initial coding module and an initial decoding module;

verifying the intermediate self-coding model by using the verification set, and determining the intermediate self-coding model as the self-encoder model when the verification result of the intermediate self-coding model on the verification set indicates that the feature identification accuracy of the intermediate self-coding model reaches a target threshold value;

when the verification result of the intermediate self-coding model on the verification set indicates that the feature recognition accuracy of the intermediate self-coding model does not reach the target threshold value, continuing to train the intermediate self-coding model by using the verification set until the feature recognition accuracy of the intermediate self-coding model reaches the target threshold value, and determining the intermediate model as the self-coder model.

5. The method of claim 4, wherein the inputting the training set into an initial self-encoder model for training, and obtaining an intermediate self-encoding model, comprises:

inputting the first time sequence behavior sequence in the training set into a first full-connection layer to obtain a first output vector;

inputting the first output vector into an initial encoder for encoding to obtain a first hidden layer state, wherein the initial encoding module comprises the first fully-connected layer and the initial encoder;

inputting the first hidden layer state into an initial decoder for decoding to obtain a second hidden layer state;

inputting the second hidden layer state into a second fully-connected layer to map the second hidden layer state into a second sequence of temporal behaviors, wherein the initial decoding module comprises the initial decoder and the second fully-connected layer;

determining a loss value of an objective function by using a difference value between the first time sequence behavior sequence and the second time sequence behavior sequence;

and adjusting network parameters in the first fully-connected layer, the initial encoder, the initial decoder and the second fully-connected layer according to the loss value to obtain the intermediate self-coding model.

6. The method of claim 5, wherein the validating the intermediate self-coding model with the validation set comprises:

inputting a third time sequence behavior sequence in the verification set into the intermediate self-coding model, and acquiring a fourth time sequence behavior sequence output by the second full-connection layer in the intermediate self-coding model;

determining a similarity of the third sequence of timing behaviors and the fourth sequence of timing behaviors;

and determining the similarity as the verification result.

7. The method of claim 6, wherein the continuing to train the intermediate self-coding model using the validation set comprises:

determining a loss value of the objective function by using the similarity;

and continuously adjusting network parameters in the first full connection layer, the initial encoder, the initial decoder and the second full connection layer according to the loss value to obtain the intermediate self-coding model, and determining the intermediate model as the self-coder model until the feature identification accuracy of the intermediate self-coding model reaches the target threshold value.

8. The method of any one of claims 1, 5, 6, and 7, further comprising training the predetermined detection model by:

step 2: randomly determining a designated dimension, and randomly generating a cut point in a plurality of sub-sample sets of a current node, wherein the dimension of the cut point is between the maximum value and the minimum value of the designated dimension of the plurality of sub-sample sets of the current node;

and step 3: generating a hyperplane according to the cutting point, and dividing a plurality of sub-sample set spaces of the current node into 2 subspaces;

and 4, step 4: and recursively performing step 2 and step 3 in the child nodes, and continuously constructing new child nodes until a termination condition is met, wherein the termination condition is that the child nodes only contain one depth feature or the child nodes reach a defined height.

9. The method of any one of claims 1, 5, 6 and 7, wherein inputting all the depth features into a preset detection model to detect the plurality of users comprises:

inputting the time sequence behavior sequence of each user into the preset self-coding model, and outputting a third hidden layer state through a coder in the preset self-coding model;

inputting the third hidden layer state into the preset detection model to identify the target user with fraudulent behavior, wherein the depth feature includes the third hidden layer state.

10. A user detection device, comprising:

the advertisement promotion platform comprises a construction module, a time sequence behavior sequence and a display module, wherein the construction module is used for constructing the time sequence behavior sequence corresponding to each user based on behavior data of a plurality of users in a preset time period, the behavior data is behavior information of each user generated in the advertisement promotion platform, and the time sequence behavior sequence is used for representing time sequence characteristics of behaviors of each user;

the extraction module is used for inputting the time sequence behavior sequence of each user into a preset self-coding model so as to extract the depth feature of the time sequence behavior sequence of each user;

and the detection module is used for inputting all the depth features into a preset detection model to detect the plurality of users and determining the users meeting the isolated conditions in the detection result as target users with fraudulent behaviors.

11. An electronic device comprising a memory, a processor, a communication interface and a communication bus, wherein the memory stores a computer program operable on the processor, and the memory and the processor communicate with the communication interface via the communication bus, wherein the processor implements the steps of the method according to any of the claims 1 to 9 when executing the computer program.

12. A computer-readable medium having non-volatile program code executable by a processor, wherein the program code causes the processor to perform the method of any of claims 1 to 9.