CN112307472A

CN112307472A - Abnormal user identification method and device based on intelligent decision and computer equipment

Info

Publication number: CN112307472A
Application number: CN202011211553.3A
Authority: CN
Inventors: 陶亦然
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-11-03
Filing date: 2020-11-03
Publication date: 2021-02-02
Also published as: WO2022095352A1

Abstract

The embodiment of the application belongs to the field of artificial intelligence and relates to an abnormal user identification method and device based on intelligent decision, computer equipment and a storage medium, wherein the method comprises the following steps: acquiring an original data set; performing data recombination on the original data set to obtain a labeled sample and an unlabeled sample; inputting the labeled sample into a first user identification model to perform first training on the first user identification model to obtain a second user identification model; performing data enhancement on the unlabeled sample to obtain an enhanced unlabeled sample set corresponding to the unlabeled sample; performing second training on the second user identification model through the labeled samples and the enhanced unlabeled sample set corresponding to the unlabeled samples to obtain an abnormal user identification model; and inputting the user sample to be identified into the abnormal user identification model to obtain a user identification result. In addition, the present application also relates to blockchain techniques, where the original data set may be stored in a blockchain. The method and the device improve the accuracy of abnormal user identification.

Description

Abnormal user identification method and device based on intelligent decision and computer equipment

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to an abnormal user identification method and apparatus based on an intelligent decision, a computer device, and a storage medium.

Background

With the development of internet technology, more and more users obtain and enjoy various information services through the internet, and a platform providing the information services records and obtains a large amount of user information. The platform providing information service often encounters various abnormal users, such as the wool party, which can obtain a great deal of benefits by using false information, and brings great loss to the platform, and meanwhile, network fraud and network attacks may occur to the abnormal users, so that the platform needs to be capable of identifying the abnormal users.

However, conventional abnormal user identification techniques typically identify through a rule model or blacklist. The rule model is organized into an empirical rule based on the discovered abnormal users, and the rule model takes subjective judgment of people as a reference, so that the coverage is poor, and the identification accuracy is low. The blacklist identification is to acquire blacklist data from the outside, track and monitor abnormal users appearing in the blacklist, and the blacklist identification cannot deal with new abnormal users appearing at any time, so that the accuracy is still low.

Disclosure of Invention

An embodiment of the application aims to provide an abnormal user identification method and device based on intelligent decision, computer equipment and a storage medium, so as to solve the problem of low accuracy of abnormal user identification.

In order to solve the above technical problem, an embodiment of the present application provides an abnormal user identification method based on an intelligent decision, which adopts the following technical scheme:

acquiring an original data set, wherein the original data set comprises blacklist data, truth-checking user data and original user data;

performing data recombination on the original data set to obtain a labeled sample and an unlabeled sample;

inputting the labeled sample into a first user identification model, and performing first training on the first user identification model through the labeled sample to obtain a second user identification model;

performing data enhancement on the unlabeled sample to obtain an enhanced unlabeled sample set corresponding to the unlabeled sample;

performing second training on the second user identification model through the labeled sample and the enhanced unlabeled sample set corresponding to the unlabeled sample to obtain an abnormal user identification model;

and inputting the user sample to be identified into the abnormal user identification model to obtain a user identification result.

Further, the step of performing data reassembly on the original data set to obtain a labeled sample and an unlabeled sample includes:

comparing the blacklist data and the authenticity checking user data with the original user data respectively to determine a labeled user list and an initial unlabeled sample;

performing data filling on the labeled user list according to the original data set to obtain an initial labeled sample;

and performing characteristic screening on the initial labeled sample and the initial unlabeled sample to obtain a labeled sample and an unlabeled sample.

Further, the step of performing feature screening on the initial labeled sample and the initial unlabeled sample to obtain a labeled sample and an unlabeled sample specifically includes:

inputting the initial labeled sample into a first user identification model, and performing third training on the first user identification model through the initial labeled sample to obtain a third user identification model;

inputting the initial unlabeled exemplar into the third user identification model to obtain a pseudo label of the initial unlabeled exemplar;

and performing characteristic screening on the initial labeled sample and the initial unlabeled sample with the pseudo label through a random forest to obtain a labeled sample and an unlabeled sample, and determining the screened characteristics as target characteristics.

Further, the step of performing feature screening on the initial labeled sample and the initial unlabeled sample with the pseudo label through a random forest to obtain a labeled sample and an unlabeled sample, and determining the screened features as target features includes:

taking the initial labeled sample and the initial unlabeled sample with the pseudo label as samples to be screened, and carrying out repeated random sampling to obtain a plurality of characteristic screening training sets;

screening a training set based on the plurality of characteristics, and generating a plurality of decision trees to obtain a random forest;

calculating a first out-of-bag data error of each decision tree in the random forest according to out-of-bag data, wherein the out-of-bag data is from a feature screening training set corresponding to each decision tree;

randomly changing features in the out-of-bag data, and calculating a second out-of-bag data error of each decision tree;

calculating the characteristic contribution degree of each characteristic according to the calculated second out-of-bag data error and the first out-of-bag data error;

and performing feature screening on the initial labeled sample and the initial unlabeled sample with the pseudo label according to the calculated feature contribution degree to obtain a labeled sample and an unlabeled sample, and determining the screened features as target features.

Further, the step of performing data enhancement on the unlabeled exemplars to obtain an enhanced unlabeled exemplar set corresponding to the unlabeled exemplars includes:

for each unlabeled sample, determining a neighboring sample set of the unlabeled samples according to Euclidean distances among the unlabeled samples, wherein the neighboring sample set comprises a preset number of neighboring samples;

for each adjacent sample, selecting an extended sample point on a characteristic space connecting line of the adjacent sample and the unlabeled sample;

and constructing an enhanced non-tag sample set corresponding to the non-tag sample according to the selected extended sample point and the non-tag sample.

Further, the step of performing second training on the second user identification model through the labeled sample and the enhanced unlabeled sample set corresponding to the unlabeled sample to obtain an abnormal user identification model includes:

inputting the labeled samples and the enhanced unlabeled sample set corresponding to the unlabeled samples into the second user identification model to obtain user prediction results of the labeled samples and user prediction results of each enhanced unlabeled sample in the enhanced unlabeled sample set;

determining the user prediction result of the unlabeled sample according to the user prediction result of each enhanced unlabeled sample;

taking the user prediction result of the unlabeled sample in the second training of the front wheel as a pseudo label of the unlabeled sample in the second training so as to calculate the regularization cross entropy loss of the labeled sample and the unlabeled sample;

and adjusting parameters of the second user identification model according to the regularized cross entropy loss until the model converges to obtain an abnormal user identification model.

Further, the step of inputting the user sample to be recognized into the abnormal user recognition model to obtain the user recognition result includes:

obtaining a user sample to be identified;

performing characteristic screening on the user sample to be identified according to preset target characteristics;

and inputting the user sample to be identified after feature screening into the abnormal user identification model to obtain a user identification result.

In order to solve the above technical problem, an embodiment of the present application further provides an abnormal user identification device based on an intelligent decision, which adopts the following technical scheme:

the system comprises a data set acquisition module, a data processing module and a data processing module, wherein the data set acquisition module is used for acquiring an original data set, and the original data set comprises blacklist data, truth-checking user data and original user data;

the data recombination module is used for carrying out data recombination on the original data set to obtain a labeled sample and a non-labeled sample;

the first training module is used for inputting the labeled sample into a first user identification model so as to perform first training on the first user identification model through the labeled sample to obtain a second user identification model;

the data enhancement module is used for performing data enhancement on the unlabeled sample to obtain an enhanced unlabeled sample set corresponding to the unlabeled sample;

the second training module is used for carrying out second training on the second user identification model through the labeled sample and the enhanced unlabeled sample set corresponding to the unlabeled sample to obtain an abnormal user identification model;

and the sample input module is used for inputting the user sample to be identified into the abnormal user identification model to obtain a user identification result.

In order to solve the above technical problem, an embodiment of the present application further provides a computer device, which includes a memory and a processor, where the memory stores a computer program, and the processor implements the steps of the intelligent decision-based abnormal user identification method when executing the computer program.

In order to solve the above technical problem, an embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the intelligent decision-based abnormal user identification method described above are implemented.

Compared with the prior art, the embodiment of the application mainly has the following beneficial effects: after an original data set is obtained, carrying out data recombination through data comparison to obtain a labeled sample and a non-labeled sample; inputting the labeled sample into a first user identification model to perform first training, and obtaining a second user identification model with certain abnormal user identification capacity; performing data enhancement on the unlabeled samples to obtain an enhanced unlabeled sample set, and predicting a plurality of similar unlabeled samples instead of predicting one unlabeled sample originally so as to improve the generalization capability of the second user identification model; the second user identification model is comprehensively trained through the labeled sample and the enhanced unlabeled sample set, the model further extracts information from the unlabeled sample for learning, and finally the abnormal user identification model is obtained.

Drawings

In order to more clearly illustrate the solution of the present application, the drawings needed for describing the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and that other drawings can be obtained by those skilled in the art without inventive effort.

FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 is a flow diagram of one embodiment of an intelligent decision-based abnormal user identification method according to the present application;

FIG. 3 is a flowchart of one embodiment of step S202 in FIG. 2;

FIG. 4 is a flowchart of one embodiment of step S2023 of FIG. 3;

FIG. 5 is a flowchart of one embodiment of step S205 of FIG. 2;

FIG. 6 is a schematic block diagram illustrating one embodiment of an intelligent decision-based abnormal user identification apparatus according to the present application;

FIG. 7 is a schematic block diagram of one embodiment of a computer device according to the present application.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "including" and "having," and any variations thereof, in the description and claims of this application and the description of the above figures are intended to cover non-exclusive inclusions. The terms "first," "second," and the like in the description and claims of this application or in the above-described drawings are used for distinguishing between different objects and not for describing a particular order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may have various communication client applications installed thereon, such as a web browser application, a shopping application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like.

The

terminal devices

101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture experts Group Audio Layer III, mpeg compression standard Audio Layer 3), MP4 players (Moving Picture experts Group Audio Layer IV, mpeg compression standard Audio Layer 4), laptop portable computers, desktop computers, and the like.

The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the

terminal devices

101, 102, 103.

It should be noted that, the abnormal user identification method based on intelligent decision provided by the embodiment of the present application is generally executed by a server, and accordingly, the abnormal user identification apparatus based on intelligent decision is generally disposed in the server.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow diagram of one embodiment of an intelligent decision-based abnormal user identification method in accordance with the present application is shown. The abnormal user identification method based on the intelligent decision comprises the following steps:

step S201, obtaining an original data set, where the original data set includes blacklist data, truth-checking user data, and original user data.

Abnormal user identification in the present application relates to intelligent decision making in artificial intelligence. In this embodiment, an electronic device (for example, a server shown in fig. 1) on which the abnormal user identification method based on intelligent decision is operated may communicate with the terminal through a wired connection manner or a wireless connection manner. It should be noted that the wireless connection means may include, but is not limited to, a 3G/4G connection, a WiFi connection, a bluetooth connection, a WiMAX connection, a Zigbee connection, a uwb (ultra wideband) connection, and other wireless connection means now known or developed in the future.

The blacklist data can be user data corresponding to the determined abnormal user; the verification user data can be user data which passes security authentication and is determined to be a non-abnormal user; the raw user data may be the full amount of user data recorded by the platform in the business or production activities.

Specifically, the server reads a raw data set from the database, wherein the raw data set comprises blacklist data, truth-checking user data and raw user data.

In one embodiment, the blacklist data may be obtained externally in advance, provided by a third party data party. The platform can carry out strict identity authentication on some users in the operation and production activities, and the user data corresponding to the user who completes the identity authentication is the verification user data. For example, in a scene identified by a woolen party, blacklist data records the woolen party determined by a third party, including a virtual mobile phone number which cannot pass verification methods such as man-machine verification. The verification user data can be user data of non-abnormal users determined by the platform through verification modes such as face recognition, bank card binding and the like.

It is emphasized that the original data set may also be stored in a node of a blockchain in order to further ensure the privacy and security of the original data set.

The block chain referred by the application is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

Step S202, the original data set is subjected to data reorganization to obtain a labeled sample and an unlabeled sample.

Specifically, comparing the user identifiers (such as user names or mobile phone numbers) of the blacklist data and the original user data, and adding the user corresponding to the repeated user identifier into the black sample; and comparing the user identifications of the verification user data and the original user data, and adding the user corresponding to the repeated user identification into the white sample.

The black samples and the white samples form labeled samples, data which is not subjected to repeated matching in the original user data is used as unlabeled samples, and the labeled samples and the unlabeled samples further comprise user data of users in the samples.

The label of the sample may identify whether the user is an anomalous user. For example, the labeled sample includes a user a and a user B, where the label of the user a is 1, which indicates that the user a is an abnormal user; the label of the user B is 0, which indicates that the user B is a non-abnormal user; the user C does not have a tag, and cannot know whether the user C is an abnormal user or a non-abnormal user.

Step S203, inputting the labeled sample into the first user identification model, and performing first training on the first user identification model through the labeled sample to obtain a second user identification model.

Wherein the first user recognition model may be a user recognition model for which the first training has not been completed.

Specifically, a labeled sample is input into a first user identification model, user data in the labeled sample is used as model input, a sample label is used as expected output of the model, and the first user identification model is trained (namely, first training) according to the model input and the expected output to obtain a second user identification model.

And step S204, performing data enhancement on the unlabeled samples to obtain an enhanced unlabeled sample set corresponding to the unlabeled samples.

Specifically, unlabeled samples are also added to model training. The unlabeled samples have no labels, so that a large error is possibly brought in training, in order to improve the generalization capability of the model, the unlabeled samples are subjected to data enhancement, that is, similar data of the unlabeled samples are generated, the data scale of the unlabeled samples is expanded, and an enhanced unlabeled sample set is obtained.

In one embodiment, based on the neighborhood risk minimization principle, linear interpolation is used to obtain enhanced unlabeled samples:

(a_new,b_new,...m_new)＝λ(a_i,b_i,...m_i)+(1-λ)*(a_j,b_j,...,m_j,) (1)

wherein (a)_new,b_new,...m_new) Is an enhanced unlabeled exemplar generated by interpolation, (a)_i,b_i,...m_i) Unlabeled sample, (a)_j,b_j,...,m_jAnd) is another randomly chosen unlabeled sample, and the value of λ is indexed to a range from 0 to 1.

And S205, performing second training on the second user identification model through the labeled samples and the enhanced unlabeled sample set corresponding to the unlabeled samples to obtain an abnormal user identification model.

And inputting the labeled samples and the enhanced unlabeled sample set corresponding to the unlabeled samples into the second user identification model. Each enhanced unlabeled sample in the enhanced unlabeled sample set has a user prediction result, the user prediction result with the highest occurrence probability is used as the user prediction result of the unlabeled sample, and the user prediction result of the unlabeled sample in the previous training round is used as a pseudo label in the training round.

And calculating cross entropy loss according to the user prediction result and the label of the labeled sample, the user prediction result of the unlabeled sample and the pseudo label, and adjusting model parameters until the model converges by taking the cross entropy loss as a target to obtain an abnormal user identification model.

And step S206, inputting the user sample to be identified into the abnormal user identification model to obtain a user identification result.

Specifically, when the model is applied, the server receives a user sample to be identified, inputs the user sample to be identified into the abnormal user identification model, and obtains a user identification result, wherein the user identification result shows whether the user is an abnormal user.

In the embodiment, after an original data set is obtained, data is recombined through data comparison to obtain a labeled sample and an unlabeled sample; inputting the labeled sample into a first user identification model to perform first training, and obtaining a second user identification model with certain abnormal user identification capacity; performing data enhancement on the unlabeled samples to obtain an enhanced unlabeled sample set, and predicting a plurality of similar unlabeled samples instead of predicting one unlabeled sample originally so as to improve the generalization capability of the second user identification model; the second user identification model is comprehensively trained through the labeled sample and the enhanced unlabeled sample set, the model further extracts information from the unlabeled sample for learning, and finally the abnormal user identification model is obtained.

Further, as shown in fig. 3, the step S202 may include:

step S2021, comparing the blacklist data and the authenticity verification user data with the original user data respectively to determine a tagged user list and an initial non-tagged sample.

Specifically, comparing the user identifiers to determine blacklist data and verify users in the user data which are repeated with the original user data, and obtaining a labeled user list; and taking the user data corresponding to the user which does not realize repeated matching in the original user data as an initial unlabeled sample.

Step S2022, performing data filling on the labeled user list according to the original data set to obtain an initial labeled sample.

Specifically, the tagged user list includes black users and white users, the black users are obtained by comparing black list data with original user data, and the white users are obtained by comparing verification user data with the original user data. The server reads the characteristics of each dimension of the black user in the blacklist data and the original user data, and adds the characteristics of each dimension into the tagged user list; and reading the characteristics of each dimension of the white user in the verification user data and the original user data, and adding the characteristics of each dimension into a labeled user list to obtain an initial labeled sample. The missing features may be feature filled; the characteristics of data collision are subject to blacklist data or verified user data.

Step S2023, performing feature screening on the initial labeled sample and the initial unlabeled sample to obtain a labeled sample and an unlabeled sample.

Specifically, the initial labeled sample and the initial unlabeled sample have more feature dimensions, and features with the same dimension can be screened from the initial labeled sample and the initial unlabeled sample to obtain a labeled sample and an unlabeled sample.

For example, in a wool party detection scenario related to card and ticket verification, the screened features may include the number of occurrences of the terminal identifier of the user terminal in a verification record within a preset time, the number of times of activations of the network address of the user terminal within the preset time, verification time, the service type, settlement price, and the like.

In this embodiment, repeated user and feature screening is determined in the data comparison, and data reorganization is performed on the original data set to obtain labeled samples and unlabeled samples for model training.

Further, as shown in fig. 4, the step S2023 may include:

step S20231, inputting the initial labeled sample into the first user identification model, and performing third training on the first user identification model through the initial labeled sample to obtain a third user identification model.

Specifically, the initial labeled sample and the initial unlabeled sample contain full-dimensional features, the initial labeled sample is input into the first user identification model, and the first user identification model is trained from the full features to obtain a third user identification model.

Step S20232, inputting the initial unlabeled exemplar into the third user identification model to obtain the pseudo label of the initial unlabeled exemplar.

Specifically, the initial unlabeled sample is input into a third user identification model for identification processing, and a pseudo label of the initial unlabeled sample is obtained. The feature screening in this application requires labels, and therefore requires that a pseudo label be added to the initial unlabeled sample first.

Step S20233, feature screening is carried out on the initial labeled sample and the initial unlabeled sample with the pseudo label through a random forest to obtain a labeled sample and an unlabeled sample, and the screened features are determined as target features.

Specifically, feature contribution degrees of all features are calculated through a random forest, the feature contribution degrees measure the importance of the features, a preset number of features are selected according to the feature contribution degrees, and data of the features which are not deleted in the initial labeled samples and the initial unlabeled samples are deleted to obtain the labeled samples and the unlabeled samples.

In this embodiment, a pseudo tag is added to the initial unlabeled sample to filter important features, obtain labeled samples and unlabeled samples, and ensure smooth implementation of model training.

Further, the step S20233 may include: taking the initial labeled sample and the initial unlabeled sample with the pseudo label as samples to be screened, and carrying out repeated random sampling to obtain a plurality of characteristic screening training sets; screening a training set based on a plurality of characteristics, and generating a plurality of decision trees to obtain a random forest; calculating a first out-of-bag data error of each decision tree in the random forest according to the out-of-bag data, wherein the out-of-bag data is from a feature screening training set corresponding to each decision tree; randomly changing the characteristics in the data outside the bag, and calculating the second data outside the bag error of each decision tree; calculating the characteristic contribution degree of each characteristic according to the calculated second out-of-bag data error and the first out-of-bag data error; and performing feature screening on the initial labeled sample and the initial unlabeled sample with the pseudo label according to the calculated feature contribution degree to obtain a labeled sample and an unlabeled sample, and determining the screened features as target features.

Specifically, the initial labeled sample and the initial unlabeled sample with the pseudo label are used as the labeled sample to be screened and are put back for a plurality of times for random sampling, and the features of the sample can be randomly sampled after each sampling to obtain a plurality of feature screening training sets. In one embodiment, the sampling with put back randomly for the sample to be screened may be a boststrapping sampling, where the boststrapping sampling refers to sampling the original sample with put back for multiple times, each sampling results in a new sample, and after repeated operations are performed for multiple times, multiple new samples are obtained, and the multiple new samples may represent the sample distribution of the original sample.

And screening the training set according to each characteristic, respectively generating decision trees, and forming a random forest by the K generated decision trees. In generating each decision tree, a full split is performed according to the information gain/information gain ratio/kini index.

When a decision tree is established according to the feature screening training set, a part of samples in the feature screening training set do not participate in the establishment of the decision tree, the part of samples are the data outside the bag of the decision tree, and the data outside the bag is usually used for evaluating the performance of the decision tree and calculating the prediction error rate, namely the error of the data outside the bag.

Inputting the data outside the bag into a decision tree, and calculating the bag according to the classification result and the sample labelObtaining an error of the first outer bag data error₁、error₂、...、error_K. Randomly changing the characteristic value of the characteristic in the data outside the bag, inputting the characteristic value into the decision tree, calculating the error of the data outside the bag to obtain a second error of the data outside the bag₁'、error'₂、...error'_K(ii) a Calculating the characteristic contribution degree of each characteristic according to the second out-of-bag data error and the first out-of-bag data error:

sorting the features according to the feature contribution degrees in a descending order, screening a preset number of features (or rejecting the features in a corresponding proportion to obtain a new sample to be screened, repeating the process with the new sample to be screened until a final preset number of features are obtained), performing data recombination on the initial labeled sample and the initial unlabeled sample according to the screened features, leaving user data corresponding to the screened features, obtaining the labeled sample and the unlabeled sample, and determining the screened features as target features.

In the embodiment, a random forest is established, the feature contribution degree of each feature is calculated, the features are screened out according to the feature contribution degree, and the labeled sample and the unlabeled sample are obtained, so that the model can carry out targeted training on the important features, and the training efficiency is improved.

Further, the step S204 may include: for each unlabeled sample, determining a neighboring sample set of the unlabeled samples according to Euclidean distances among the unlabeled samples, wherein the neighboring sample set comprises a preset number of neighboring samples; for each adjacent sample, selecting an extended sample point on a characteristic space connecting line of the adjacent sample and the unlabeled sample; and constructing an enhanced non-tag sample set corresponding to the non-tag sample according to the selected extended sample point and the non-tag sample.

In particular, unlabeled exemplars can be viewed as points in a feature space, the dimensions of which are the same as the unlabeled exemplar feature dimensions. And for each unlabeled sample, determining Euclidean distances between the unlabeled sample and other unlabeled samples, sequencing the Euclidean distances from small to large, selecting a preset number of unlabeled samples to obtain a near sample set, wherein each unlabeled sample in the near sample set can be regarded as a near sample of the original unlabeled sample.

A characteristic space connecting line exists between the label-free sample and the adjacent sample, and a preset number of points are randomly selected on the characteristic space connecting line to obtain an expanded sample point:

(a_new,b_new,...m_new)＝(a,b,...m)+rand(0-1)*((a_n-a,),(b_n-b,)...(m_n-m,)) (3)

wherein the characteristic dimension of the label-free sample is m, (a)_new,b_new,...,m_new) Is the coordinates of the extended sample points in the feature space, (a, b., m) is the coordinates of the unlabeled sample in the feature space, a_n、b_n、...、m_nRepresenting the coordinates of adjacent samples in each dimension in the feature space, rand (0-1) is an adjusting factor for adjusting the distance from the extended sample point to the unlabeled sample

After each adjacent sample selects the expansion sample point, the expansion sample corresponding to the label-free sample is obtained according to the coordinates of the expansion sample point in the feature space, and the label-free sample and the expansion sample corresponding to the label-free sample can be used as enhanced label-free samples to be combined into an enhanced label-free sample set.

In this embodiment, the adjacent samples of the unlabeled samples are determined in the feature space according to the euclidean distance, and the extended sample points are generated according to the unlabeled samples and the adjacent samples, so that a plurality of extended samples similar to the unlabeled samples can be generated, and data enhancement is realized.

Further, as shown in fig. 5, the step S205 may include:

step S2051, the labeled samples and the enhanced unlabeled sample sets corresponding to the unlabeled samples are input into the second user identification model, so as to obtain the user prediction results of the labeled samples and the user prediction results of each enhanced unlabeled sample in the enhanced unlabeled sample sets.

Specifically, the server inputs the labeled sample and the enhanced unlabeled sample set into a second user identification model to obtain a user prediction result of the labeled sample; the enhanced unlabeled sample set is provided with a plurality of enhanced unlabeled samples, and each enhanced unlabeled sample has a corresponding user prediction result.

Step S2052, determining a user prediction result of the unlabeled exemplar according to the user prediction result of each enhanced unlabeled exemplar.

Specifically, the user prediction results of the enhanced unlabeled samples are classified, and the user prediction result with the highest frequency is used as the user prediction result of the unlabeled sample corresponding to the enhanced unlabeled sample set.

Step S2053, using the user prediction result of the unlabeled sample in the second training of the front wheel as the pseudo label of the unlabeled sample in the current second training, so as to calculate the regularized cross entropy loss of the labeled sample and the unlabeled sample.

Wherein the regularized cross-entropy loss is a loss function of the second user identification model.

Specifically, the second training consists of multiple rounds of training, each round of training outputting a user prediction result for an unlabeled sample. And when the second training of the current round is carried out, taking the user prediction result of the unlabeled sample in the second training of the front round as the pseudo label of the unlabeled sample. Combining the label and user prediction results of the labeled samples and the pseudo label and user prediction results of the unlabeled samples, calculating regularized cross entropy loss:

wherein the content of the first and second substances,

for the cross-entropy loss of the labeled exemplars,

in order to label the sample with the labeled sample,

the user prediction result of the labeled sample is obtained, and n is the number of the labeled samples; regularization term

Cross entropy loss for unlabeled samples;

is a pseudo-label for the unlabeled exemplar,

the user prediction result of the unlabeled sample is shown, n' is the number of samples of the unlabeled sample, C is the number of classes of the sample, and alpha (t) is a time-varying parameter.

In one embodiment, the time-varying parameters are as follows:

wherein, T₁And T₂Representing the training round of the second training, α_fIs the maximum value of the time-varying parameter. The information extracted from the label-free sample by the second user identification model is gradually enhanced according to the time-varying parameters, and correspondingly, the identification accuracy of the second user identification model is gradually improved along with the deepening of training, so that the accuracy of the finally obtained abnormal user identification model is ensured.

It is easy to know that in the first round of training of the second training, the user prediction result of the unlabeled sample will be output, but the regularized cross entropy loss will not be calculated.

And step S2054, performing parameter adjustment on the second user identification model according to the regularized cross entropy loss until the model converges to obtain an abnormal user identification model.

And the server adjusts the model parameters by taking the minimized regularized cross entropy loss as a target until the second user identification model is converged to obtain an abnormal user identification model.

In one embodiment, the user identification model is built based on an LGBM algorithm. Lgbm (lightbgm) is an optimization framework for implementing GBDT algorithm, and its main idea is to use weak classifier (decision tree) to iteratively train to obtain an optimal model. And traversing each feature by the LGBM through multiple iterations, then traversing all possible segmentation points of each feature to find the optimal segmentation point j of the optimal feature m, generating a weak classifier based on a decision tree by each iteration, and training each classifier on the basis of the residual error of the previous classifier. The weak classifiers are to satisfy low variance and high variance. The LGBM algorithm training process is to continuously improve the precision of the final classifier by reducing the deviation.

In this embodiment, the enhanced unlabeled sample set is input into the second user identification model to obtain a user identification result of the unlabeled sample, the regularized cross entropy loss is calculated in combination with the user identification result of the labeled sample, and the model parameters are adjusted according to the loss, so that the second user identification model is further trained according to the unlabeled sample, and the accuracy of the obtained abnormal user identification model is ensured.

Further, the step S206 may include: obtaining a user sample to be identified; performing feature screening on a user sample to be identified according to a preset target feature; and inputting the user sample to be identified after the characteristic screening into the abnormal user identification model to obtain a user identification result.

In particular, the sample to be identified may be input by the user at the terminal. And determining target characteristics according to the characteristic contribution degree during characteristic screening, and performing characteristic screening on the user sample to be identified according to the target characteristics to remove characteristics except the target characteristics. And inputting the user sample to be identified after the characteristic screening into the abnormal user identification model to obtain a user identification result.

In the embodiment, after the user sample to be identified is obtained, feature screening is performed on the sample according to the preset target feature to obtain the sample with the feature dimension conforming to the model, so that the accuracy of the user identification result is ensured.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the computer program is executed. The storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

With further reference to fig. 6, as an implementation of the method shown in fig. 2, the present application provides an embodiment of an abnormal user identification apparatus 300 based on intelligent decision, which corresponds to the embodiment of the method shown in fig. 2, and which can be applied in various electronic devices.

As shown in fig. 6, the abnormal user identification apparatus 300 based on intelligent decision according to the present embodiment includes: a data set acquisition module 301, a data reorganization module 302, a first training module 303, a data enhancement module 304, a second training module 305, and a sample input module 306, wherein:

a data set obtaining module 301, configured to obtain an original data set, where the original data set includes blacklist data, truth-checking user data, and original user data.

The data restructuring module 302 is configured to perform data restructuring on the original data set to obtain a labeled sample and an unlabeled sample.

The first training module 303 is configured to input the labeled sample into the first user identification model, so as to perform first training on the first user identification model through the labeled sample, and obtain a second user identification model.

And the data enhancement module 304 is configured to perform data enhancement on the unlabeled sample to obtain an enhanced unlabeled sample set corresponding to the unlabeled sample.

The second training module 305 is configured to perform second training on the second user identification model through the labeled samples and the enhanced unlabeled sample set corresponding to the unlabeled samples, so as to obtain an abnormal user identification model.

And the sample input module 306 is used for inputting the user sample to be identified into the abnormal user identification model to obtain a user identification result.

In this embodiment, after an original data set is obtained, data is recombined through data comparison to obtain a labeled sample and an unlabeled sample; inputting the labeled sample into a first user identification model to perform first training, and obtaining a second user identification model with certain abnormal user identification capacity; performing data enhancement on the unlabeled samples to obtain an enhanced unlabeled sample set, and predicting a plurality of similar unlabeled samples instead of predicting one unlabeled sample originally so as to improve the generalization capability of the second user identification model; the second user identification model is comprehensively trained through the labeled sample and the enhanced unlabeled sample set, the model further extracts information from the unlabeled sample for learning, and finally the abnormal user identification model is obtained.

In some optional implementations of this embodiment, the data reorganization module 302 includes: data comparison submodule, data filling submodule and characteristic screening submodule, wherein:

and the data comparison submodule is used for respectively performing data comparison on the blacklist data and the authenticity checking user data and the original user data so as to determine a labeled user list and an initial unlabeled sample.

And the data filling submodule is used for performing data filling on the labeled user list according to the original data set to obtain an initial labeled sample.

And the characteristic screening submodule is used for carrying out characteristic screening on the initial labeled sample and the initial unlabeled sample to obtain a labeled sample and an unlabeled sample.

In some optional implementations of this embodiment, the feature filtering sub-module includes: training unit, input unit and screening unit, wherein:

and the training unit is used for inputting the initial labeled sample into the first user identification model so as to carry out third training on the first user identification model through the initial labeled sample to obtain a third user identification model.

And the input unit is used for inputting the initial unlabeled sample into the third user identification model to obtain the pseudo label of the initial unlabeled sample.

And the screening unit is used for carrying out characteristic screening on the initial labeled sample and the initial unlabeled sample with the pseudo label through a random forest to obtain a labeled sample and an unlabeled sample, and determining the screened characteristics as target characteristics.

In some optional implementations of this embodiment, the screening unit includes: the device comprises a sampling subunit, a generating subunit, a first calculating subunit, a second calculating subunit, a contribution calculating subunit and a feature screening subunit, wherein:

and the sampling subunit is used for performing random sampling with the initial labeled sample and the initial unlabeled sample with the pseudo label as samples to be screened for a plurality of times, so as to obtain a plurality of characteristic screening training sets.

And the generating subunit is used for screening the training set based on the characteristics and generating a plurality of decision trees to obtain the random forest.

And the first calculating subunit is used for calculating first out-of-bag data errors of each decision tree in the random forest according to the out-of-bag data, wherein the out-of-bag data come from the feature screening training set corresponding to each decision tree.

And the second calculating subunit is used for randomly changing the characteristics in the out-of-bag data and calculating a second out-of-bag data error of each decision tree.

And the contribution calculating subunit is used for calculating the characteristic contribution degree of each characteristic according to the calculated second out-of-bag data error and the first out-of-bag data error.

And the characteristic screening subunit is used for carrying out characteristic screening on the initial labeled sample and the initial unlabeled sample with the pseudo label according to the calculated characteristic contribution degree to obtain a labeled sample and an unlabeled sample, and determining the screened characteristics as target characteristics.

In some optional implementations of this embodiment, the data enhancement module 303 includes: the device comprises a sample determining submodule, a sample point selecting submodule and a sample set constructing submodule, wherein:

and the sample determining submodule is used for determining an adjacent sample set of the unlabeled samples according to the Euclidean distance between the unlabeled samples for each unlabeled sample, wherein the adjacent sample set comprises a preset number of adjacent samples.

And the sample point selection submodule is used for selecting an expanded sample point on a characteristic space connecting line of the adjacent sample and the unlabeled sample for each adjacent sample.

And the sample set construction submodule is used for constructing and obtaining an enhanced non-tag sample set corresponding to the non-tag sample according to the selected extended sample point and the non-tag sample.

In some optional implementations of this embodiment, the second training module 304 includes: the device comprises a sample input submodule, a result determination submodule, a loss calculation submodule and a parameter adjustment submodule, wherein:

and the sample input submodule is used for inputting the labeled samples and the enhanced unlabeled sample set corresponding to the unlabeled samples into the second user identification model to obtain the user prediction results of the labeled samples and the user prediction results of the enhanced unlabeled samples in the enhanced unlabeled sample set.

And the result determining submodule is used for determining the user prediction result of the unlabeled sample according to the user prediction result of each enhanced unlabeled sample.

And the loss calculation submodule is used for taking the user prediction result of the unlabeled sample in the second training of the front wheel as the pseudo label of the unlabeled sample in the current second training so as to calculate the regularized cross entropy loss of the labeled sample and the unlabeled sample.

And the parameter adjusting submodule is used for adjusting the parameters of the second user identification model according to the regularized cross entropy loss until the model converges to obtain an abnormal user identification model.

In some optional implementations of this embodiment, the sample input module 306 includes: the sample obtains submodule piece, screening submodule piece and discerns the input submodule piece, wherein:

and the sample acquisition submodule is used for acquiring a user sample to be identified.

And the screening submodule is used for screening the characteristics of the user sample to be identified according to the preset target characteristics.

And the recognition input submodule is used for inputting the user sample to be recognized after the characteristics are screened into the abnormal user recognition model to obtain a user recognition result.

In order to solve the technical problem, an embodiment of the present application further provides a computer device. Referring to fig. 7, fig. 7 is a block diagram of a basic structure of a computer device according to the present embodiment.

The computer device 4 comprises a memory 41, a processor 42, a network interface 43 communicatively connected to each other via a system bus. It is noted that only computer device 4 having components 41-43 is shown, but it is understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead. As will be understood by those skilled in the art, the computer device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.

The computer device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The computer equipment can carry out man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel or voice control equipment and the like.

The memory 41 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the memory 41 may be an internal storage unit of the computer device 4, such as a hard disk or a memory of the computer device 4. In other embodiments, the memory 41 may also be an external storage device of the computer device 4, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the computer device 4. Of course, the memory 41 may also include both internal and external storage devices of the computer device 4. In this embodiment, the memory 41 is generally used for storing an operating system installed in the computer device 4 and various types of application software, such as computer readable instructions of an abnormal user identification method based on intelligent decision-making. Further, the memory 41 may also be used to temporarily store various types of data that have been output or are to be output.

The processor 42 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 42 is typically used to control the overall operation of the computer device 4. In this embodiment, the processor 42 is configured to execute computer readable instructions stored in the memory 41 or process data, for example, execute computer readable instructions of the intelligent decision-based abnormal user identification method.

The network interface 43 may comprise a wireless network interface or a wired network interface, and the network interface 43 is generally used for establishing communication connection between the computer device 4 and other electronic devices.

The computer device provided in this embodiment may perform the steps of the above abnormal user identification method based on intelligent decision. Here, the steps of the intelligent decision-based abnormal user identification method may be steps in the intelligent decision-based abnormal user identification method of the above embodiments.

The present application further provides another embodiment, which is to provide a computer-readable storage medium storing computer-readable instructions executable by at least one processor to cause the at least one processor to perform the steps of the intelligent decision-based abnormal user identification method as described above.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.

It is to be understood that the above-described embodiments are merely illustrative of some, but not restrictive, of the broad invention, and that the appended drawings illustrate preferred embodiments of the invention and do not limit the scope of the invention. This application is capable of embodiments in many different forms and is provided for the purpose of enabling a thorough understanding of the disclosure of the application. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to one skilled in the art that the present application may be practiced without modification or with equivalents of some of the features described in the foregoing embodiments. All equivalent structures made by using the contents of the specification and the drawings of the present application are directly or indirectly applied to other related technical fields and are within the protection scope of the present application.

Claims

1. An abnormal user identification method based on intelligent decision is characterized by comprising the following steps:

2. The method for identifying abnormal users based on intelligent decision as claimed in claim 1, wherein the step of performing data reorganization on the original data set to obtain labeled samples and unlabeled samples comprises:

3. The abnormal user identification method based on intelligent decision as claimed in claim 1, wherein the step of performing feature screening on the initial labeled sample and the initial unlabeled sample to obtain a labeled sample and an unlabeled sample specifically comprises:

4. An intelligent decision-based abnormal user identification method according to claim 3, wherein the step of performing feature screening on the initial labeled sample and the initial unlabeled sample with the pseudo label through a random forest to obtain a labeled sample and an unlabeled sample, and determining the screened features as target features comprises:

5. An intelligent decision-based abnormal user identification method according to claim 1, wherein the step of performing data enhancement on the unlabeled exemplars to obtain an enhanced unlabeled exemplar set corresponding to the unlabeled exemplars comprises:

6. An intelligent decision-based abnormal user identification method according to claim 1, wherein the step of performing a second training on the second user identification model through the labeled samples and the enhanced unlabeled sample set corresponding to the unlabeled samples to obtain an abnormal user identification model comprises:

7. An intelligent decision-making based abnormal user identification method according to claim 3, wherein the step of inputting the sample of users to be identified into the abnormal user identification model to obtain the user identification result comprises:

obtaining a user sample to be identified;

8. An abnormal user identification device based on intelligent decision, which is characterized by comprising:

9. A computer device comprising a memory having computer readable instructions stored therein and a processor which when executed implements the steps of the intelligent decision-making based abnormal user identification method of any one of claims 1 to 7.

10. A computer-readable storage medium, having computer-readable instructions stored thereon, which, when executed by a processor, implement the steps of the intelligent decision-based abnormal user identification method according to any one of claims 1 to 7.