CN112905987A - Account identification method, account identification device, server and storage medium - Google Patents

Account identification method, account identification device, server and storage medium Download PDF

Info

Publication number
CN112905987A
CN112905987A CN201911136455.5A CN201911136455A CN112905987A CN 112905987 A CN112905987 A CN 112905987A CN 201911136455 A CN201911136455 A CN 201911136455A CN 112905987 A CN112905987 A CN 112905987A
Authority
CN
China
Prior art keywords
account
features
category
sample
characteristic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911136455.5A
Other languages
Chinese (zh)
Other versions
CN112905987B (en
Inventor
郗剑亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Dajia Internet Information Technology Co Ltd
Original Assignee
Beijing Dajia Internet Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Dajia Internet Information Technology Co Ltd filed Critical Beijing Dajia Internet Information Technology Co Ltd
Priority to CN201911136455.5A priority Critical patent/CN112905987B/en
Publication of CN112905987A publication Critical patent/CN112905987A/en
Application granted granted Critical
Publication of CN112905987B publication Critical patent/CN112905987B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/30Authentication, i.e. establishing the identity or authorisation of security principals
    • G06F21/45Structures or tools for the administration of authentication
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/552Detecting local intrusion or implementing counter-measures involving long-term monitoring or reporting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/554Detecting local intrusion or implementing counter-measures involving event detection and direct action

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The disclosure relates to an account identification method, an account identification device, a server and a storage medium. According to the method, the account numbers are classified to determine some characteristics of the account numbers on the whole, and the characteristics are fused by combining the classification characteristics generated in classification and some characteristics with higher importance to obtain the characteristics capable of comprehensively representing the account numbers, and then the classification is performed based on the characteristics to realize the identification of the account numbers through some characteristics of the learned malicious account numbers, so that the accuracy of the identification of the malicious account numbers can be greatly improved. The process combines the supervision and non-supervision modes, creates a cascade processing method, fully utilizes the non-supervision characteristic to obtain the characteristics of the category of the account, and further classifies the category obtained in the non-supervision mode again in the supervision mode, thereby achieving the purpose of accurate division.

Description

Account identification method, account identification device, server and storage medium
Technical Field
The present disclosure relates to the field of network technologies, and in particular, to an account identification method, apparatus, server, and storage medium.
Background
In many internet application scenarios, such as an electronic commerce scenario, a virtual social networking scenario, a financial service scenario, a video website, etc., in order to obtain an improper benefit, some people may maliciously register many accounts based on false information, and execute illegal acts such as malicious billing and fraud through the accounts, so that the accounts need to be identified to maintain the benefits of users, merchants, and operators.
The current account identification can be generally realized by setting identification rules, and accounts with account information which does not conform to the identification rules are identified as maliciously registered accounts. However, the above recognition method has the advantages of easy configuration, etc., but is easily bypassed, resulting in low recognition accuracy.
Disclosure of Invention
The disclosure provides an account identification method, an account identification device, a server and a storage medium, which are used for at least solving the problem of low identification accuracy in the related art. The technical scheme of the disclosure is as follows:
in a first aspect, an account identification method is provided, including:
acquiring a first account characteristic of an account to be identified; determining a first category of the account and a category characteristic of the account based on the first account characteristic, wherein the category characteristic of the account is used for representing a relationship between the account and the first category; performing feature fusion on the category features and the first account features to obtain second account features of the account; inputting the second account characteristics of the account into a target classification model, and predicting whether the account is a target type account through the target classification model to obtain the identification result of the account.
In one possible implementation manner, the acquiring the first account characteristic of the account to be identified includes: and splicing the user data characteristics, the login characteristics and the user behavior characteristics of the account to obtain the first account characteristics of the account.
In one possible implementation manner, the obtaining the first account characteristic of the account by concatenating the user profile characteristic, the login characteristic, and the user behavior characteristic of the account includes: and respectively coding each characteristic to obtain coded characteristics, and splicing the coded characteristics to obtain the first account characteristic of the account.
In one possible implementation, the method further comprises: when the target feature is coded, the target feature is segmented to obtain multiple segments of sub-features of the target feature, the multiple segments of sub-features are respectively coded, and coding results are spliced to obtain the coded target feature.
In one possible implementation manner, the determining, based on the first account characteristic, a first category of the account and a category characteristic of the account includes: inputting the first account feature into a clustering model, and obtaining a first category of the account and a category feature of the account according to a distance relationship between the first account feature and a plurality of clusters through the clustering model.
In one possible implementation, the method includes: and when the category is determined based on the first account number characteristic, performing parallel computation through a GPU.
In one possible implementation manner, before the inputting the second account characteristics of the account into the target classification model, the method further includes: acquiring first sample account characteristics of a plurality of sample accounts; determining a plurality of categories of the sample account numbers and category characteristics of the sample account numbers based on the plurality of first sample account number characteristics, wherein the category characteristics of the sample account numbers are used for representing the relationship between the sample account numbers and the categories; respectively carrying out feature fusion on the category features of the sample account numbers and the label information of the first sample account numbers and the sample account numbers to obtain second sample account number features of the sample account numbers; and training by adopting the second sample account characteristics of the plurality of sample accounts to obtain the target classification model.
In one possible implementation, the method further comprises: calculating the weight of the input sample characteristics of the sample accounts through a tree model, deleting the characteristics with the weight smaller than the target weight, and acquiring the rest characteristics as the first sample account characteristics of the sample accounts.
In a second aspect, an account identification apparatus is provided, including:
the account identification device comprises an acquisition unit, a recognition unit and a recognition unit, wherein the acquisition unit is configured to acquire a first account characteristic of an account to be identified;
the determining unit is configured to determine a first category of the account and a category feature of the account based on the first account feature, wherein the category feature of the account is used for representing a relationship between the account and the first category;
the feature fusion unit is configured to perform feature fusion on the category features and the first account features to obtain second account features of the account;
and the identification unit is configured to input the second account characteristics of the account into a target classification model, and predict whether the account is a target type account through the target classification model to obtain an identification result of the account.
In a possible implementation manner, the obtaining unit is configured to perform stitching of the user profile characteristic, the login characteristic, and the user behavior characteristic of the account to obtain the first account characteristic of the account.
In a possible implementation manner, the obtaining unit is configured to encode each feature respectively to obtain encoded features, and splice the encoded features to obtain the first account feature of the account.
In a possible implementation manner, the obtaining unit is configured to, when encoding a target feature, segment the target feature to obtain multiple segments of sub-features of the target feature, encode the multiple segments of sub-features respectively, and splice encoding results to obtain the encoded target feature.
In a possible implementation manner, the determining unit is configured to perform inputting the first account feature into a clustering model, and obtain the first category of the account and the category feature of the account according to a distance relationship between the first account feature and a plurality of clusters through the clustering model.
In one possible implementation manner, the determining unit performs parallel computation by a GPU when determining the category based on the first account characteristic.
In one possible implementation, the apparatus further includes: a model training unit configured to perform:
acquiring first sample account characteristics of a plurality of sample accounts; determining a plurality of categories of the sample account numbers and category characteristics of the sample account numbers based on a first target sample characteristic in the plurality of first sample account numbers, wherein the category characteristics of the sample account numbers are used for representing the relationship between the sample account numbers and the categories; respectively carrying out feature fusion on the category features of the plurality of sample account numbers and second target sample features in the plurality of first sample account number features and label information of the plurality of sample account numbers to obtain second sample account number features of the plurality of sample account numbers, wherein the weight of the second target sample features is greater than or equal to the target weight; and training by adopting the second sample account characteristics of the plurality of sample accounts to obtain the target classification model.
In one possible implementation, the apparatus further includes:
a feature processing unit configured to perform calculating weights of input sample features of the plurality of sample accounts through a tree model; and deleting the features with the weight smaller than the target weight, and acquiring the rest features as the first sample account features of the plurality of sample accounts.
According to a third aspect of the embodiments of the present disclosure, there is provided a server, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to execute the instructions to implement the account identification method as in any one of the above.
According to a fourth aspect of embodiments of the present disclosure, there is provided a storage medium having instructions that, when executed by a processor of a server, enable the server to perform an account identification method as in any one of the above.
According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product comprising executable instructions that, when executed by a processor of a server, enable the server to perform the account identification method of any one of the above.
The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects: the method comprises the steps of classifying the account numbers to determine some characteristics of the account numbers on the whole, combining the classification characteristics generated in classification and some characteristics with higher importance, performing characteristic fusion to obtain characteristics capable of comprehensively representing the account numbers, classifying based on the characteristics, recognizing the account numbers by some characteristics of the learned malicious account numbers, and greatly improving the accuracy and recall rate of recognizing the malicious account numbers. The process combines the supervision and non-supervision modes, creates a cascade processing method, fully utilizes the non-supervision characteristic to obtain the characteristics of the category of the account, and further classifies the category obtained in the non-supervision mode again in the supervision mode, thereby achieving the purpose of accurate division.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.
Fig. 1 is a flowchart illustrating an account identification method according to an exemplary embodiment.
Fig. 2 is a flowchart illustrating an account identification method according to an exemplary embodiment.
Fig. 3 is a diagram illustrating a number of different technical processes involved in an account identification process, according to an example embodiment.
Fig. 4 is a block diagram illustrating an account identification apparatus according to an example embodiment.
FIG. 5 is a block diagram illustrating a server in accordance with an example embodiment.
Detailed Description
In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.
It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
The user information to which the present disclosure relates may be information authorized by the user or sufficiently authorized by each party.
Fig. 1 is a flowchart illustrating an account identification method according to an exemplary embodiment, where the account identification method is used in a server, as shown in fig. 1, and includes the following steps.
In step 101, a first account characteristic of an account to be identified is obtained.
In step 102, a first category of the account and a category characteristic of the account are determined based on the first account characteristic, where the category characteristic of the account is used to represent a relationship between the account and the first category.
In step 103, the category characteristic of the account and the first account characteristic are subjected to characteristic fusion to obtain a second account characteristic of the account.
In step 104, inputting the second account characteristics of the account into a target classification model, and predicting whether the account is a target type account through the target classification model to obtain an identification result of the account.
According to the method provided by the embodiment of the disclosure, classification is performed on the account to determine some characteristics of the account as a whole, and then characteristic fusion is performed by combining the classification characteristics generated during classification and some characteristics with higher importance to obtain characteristics capable of comprehensively representing the account, and classification is performed based on the characteristics to realize identification of the account through some characteristics of the learned malicious account, so that the accuracy of identification of the malicious account can be greatly improved. The process combines the supervision and non-supervision modes, creates a cascade processing method, fully utilizes the non-supervision characteristic to obtain the characteristics of the category of the account, and further classifies the category obtained in the non-supervision mode again in the supervision mode, thereby achieving the purpose of accurate division.
In one possible implementation manner, the acquiring the first account characteristic of the account to be identified includes: and splicing the user data characteristics, the login characteristics and the user behavior characteristics of the account to obtain the first account characteristics of the account.
In one possible implementation manner, the obtaining the first account characteristic of the account by concatenating the user profile characteristic, the login characteristic, and the user behavior characteristic of the account includes: and respectively coding each characteristic to obtain coded characteristics, and splicing the coded characteristics to obtain the first account characteristic of the account.
In one possible implementation, the method further includes:
when the target features are coded, the target features are segmented to obtain multiple segments of sub-features of the target features, the multiple segments of sub-features are coded respectively, and coding results are spliced to obtain the coded target features.
In one possible implementation manner, the determining, based on the first account characteristic, a first category of the account and a category characteristic of the account includes:
inputting the first account characteristics into a clustering model, and obtaining a first category of the account and category characteristics of the account according to the distance relationship between the first account characteristics and a plurality of clusters through the clustering model.
In one possible implementation, the method includes: and when the category is determined based on the first account characteristics, performing parallel computation through a GPU.
In one possible implementation manner, before the inputting the second account characteristics of the account into the target classification model, the method further includes:
acquiring first sample account characteristics of a plurality of sample accounts;
determining a plurality of categories of the sample account numbers and category characteristics of the sample account numbers based on the plurality of first sample account number characteristics, wherein the category characteristics of the sample account numbers are used for representing the relationship between the sample account numbers and the categories;
respectively carrying out feature fusion on the category features of the sample accounts and the first sample account features and the label information of the sample accounts to obtain second sample account features of the sample accounts;
and training by adopting the second sample account characteristics of the plurality of sample accounts to obtain the target classification model.
In one possible implementation, the method further includes: calculating the weight of the input sample characteristics of the sample accounts through a tree model, deleting the characteristics with the weight smaller than the target weight, and splicing the rest characteristics into the first sample account characteristics of the sample accounts.
All the above optional technical solutions may be combined arbitrarily to form the optional embodiments of the present disclosure, and are not described herein again.
Fig. 2 is a flowchart illustrating an account identification method according to an exemplary embodiment, where the account identification method is used in a server, as shown in fig. 2, and includes the following steps.
In step 201, the server obtains the user profile characteristics, login characteristics and user behavior characteristics of the account to be identified.
When the account is identified, the server may obtain basic data of the account to be identified, where the basic data may be a user profile feature, the user profile feature may be obtained from a user profile database, and may be profile information filled during user registration, or information obtained after the user updates the profile information, such as information of user gender, user age, and a region where the user is located, which is not limited in the embodiment of the present disclosure. The basic data may also include some front-end and back-end features of the user, for example, a login device model, a login system version, a login IP address, etc. used by the user, which may be collectively referred to as login features, and in addition, the basic data may also include user behavior features to represent the behavior of the user in the process of login, such as clicking, watching, paying attention, etc., which may cover almost all features of the account, which may achieve a deeper description of the user.
In a possible implementation manner, the server may further obtain a user portrait of the account, where the user portrait may be generated based on user data characteristics, historical behavior characteristics, and the like of the account, and in combination with the user portrait and the above characteristics, the characteristic information covering the account in a comprehensive manner may be obtained, so that omission of information is avoided.
In step 202, the server encodes each feature to obtain encoded features.
The features obtained as described above may be encoded so that the features are expressed in a uniform format, and the features of different categories may be encoded in accordance with an encoding scheme corresponding to the features, for example, the features may be encoded as vectors of a fixed length for gender, as (0,0) for gender, and as (0,1) for gender, and as males. Of course, the lengths of the vector representations of the different classes of features may not be equal, and the embodiment of the present disclosure does not limit this.
In the above features, some features are continuous features, and some features are discrete features, and some simplification processes may be performed on the discrete features during encoding to achieve the purposes of reducing the amount of computation and improving the computation efficiency, for example, for a target feature, folding encoding may be performed on the target feature, that is, the target feature may be segmented to obtain multiple segments of sub-features of the target feature, the multiple segments of sub-features may be encoded separately, and the encoding results may be spliced to obtain the encoded target feature.
By taking target characteristics as an example of logging in an IP (Internet protocol), a logging in IP address can be divided into 4 sections, each section is independently coded, namely, the number of coded bits is shortened from 255^4 to 255 x 4, and the compression ratio is about 400 thousands.
In some possible implementation manners, the encoding process may use one-hot encoding, and the encoding manner may be applicable to a case where the feature dimension is not too high, and the encoding manner is simple, so that the encoding time consumption can be greatly reduced.
In step 203, the server splices the encoded features to obtain a first account feature of the account.
In order to express the features of one account, the features may be spliced together in the above-described splicing manner, and as a description of the account, based on the above-described vector expression, feature vectors obtained by encoding the features may be spliced in a predetermined order.
It should be noted that, for some account numbers, features of some aspects may be missing, and for such account numbers, the missing features may be supplemented, for example, the missing features are mapped to a preset vector corresponding to the features, and the preset vector is used as the feature vector of the missing features to perform a splicing step during splicing.
For example, for category features (e.g., user gender, mobile phone system version, etc.) and numerical features (e.g., user registration timestamp, usage duration, etc.), encoding and then concatenating the category features and the numerical features into a feature vector, the feature vector may be obtained as follows: (1000,[2,3,9,100,999],[3.0,1.0,0.0,101.25,0.1]).
In step 204, the server inputs the first account characteristic into a clustering model, and obtains a first category of the account and a category characteristic of the account according to a distance relationship between the first account characteristic and a plurality of clusters through the clustering model.
When clustering is performed by the clustering model, the first account feature may be clustered into a cluster to which a cluster center closest to the first account feature belongs based on a distance relationship between the first account feature and the cluster center of each cluster, and a number of the cluster and a distance between the first account feature and the cluster center are output to represent a relationship between the account and a category to which the cluster belongs.
The step 204 is a process of determining the first category of the account and the category characteristics of the account based on the first account characteristics, and the clustering model may be implemented by using a KMeans + + algorithm in the embodiment of the present disclosure, and the algorithm is simple to implement and fast in convergence speed. In some possible implementation manners, when the server performs clustering, the GPU may also be used to perform parallel computation, so as to ensure that data output is completed on time under an online condition.
In step 205, the server performs feature fusion on the category feature of the account and the first account feature to obtain a second account feature of the account.
It should be noted that the category features obtained by the clustering may be encoded before feature fusion, and the encoded category features may be used for feature fusion. For example, the encoded class feature may be [2001,105,100], and based on the first account feature (1000, [2,3,9,100,999], [3.0,1.0,0.0,101.25,0.1]) and the feature vector of the class feature in the above example, the following features can be obtained by concatenation:
(1000,[2,3,9,100,999],[3.0,1.0,0.0,101.25,0.1],[2001,105,100])。
in step 206, the server inputs the second account characteristics of the account into a target classification model, and predicts whether the account is a target type account through the target classification model to obtain the identification result of the account.
The target classification model can be a model which learns the characteristics of the malicious account, the adopted training data can comprise the characteristics of the same reason, and the label information is added to realize supervised learning, so that the aim of identifying whether the account is the malicious account is fulfilled. For the process of specifically how to train the model, the embodiments of the present disclosure will be detailed in the following.
The target classification model can be used for predicting whether the account is a target type account, the prediction result can be a two-classification result, namely, a yes or no result can be output, and of course, a multi-classification result can be adopted to improve the scene applicability of the model, so that the model can be applied to various different recognition scenes. Of course, the classification algorithm applied by the classification model may also have various choices, such as RF, GBDT, LR, NN, etc., and may be selected in combination with a service scenario, and the embodiment of the present disclosure does not limit which algorithm is specifically adopted.
According to the method provided by the embodiment of the disclosure, classification is performed on the account to determine some characteristics of the account as a whole, and then characteristic fusion is performed by combining the classification characteristics generated during classification and some characteristics with higher importance to obtain characteristics capable of comprehensively representing the account, and classification is performed based on the characteristics to realize identification of the account through some characteristics of the learned malicious account, so that the accuracy of identification of the malicious account can be greatly improved. The process combines the supervision and non-supervision modes, creates a cascade processing method, fully utilizes the non-supervision characteristic to obtain the characteristics of the category of the account, and further classifies the category obtained in the non-supervision mode again in the supervision mode, thereby achieving the purpose of accurate division. Furthermore, in the process of coding some discrete high-dimensional characteristics, a coding mode of shortening the number of characteristic coding bits is adopted, more and more effective information can be kept as far as possible in the dimension as low as possible, the calculated amount is greatly reduced, the calculation resources are saved, and meanwhile, the response result can be given more quickly. Furthermore, when the features are selected, effective features are selected for clustering based on the importance of the features, and the interference of the ineffective features on the identification process is avoided.
Further, the difference of the above-mentioned identification results can also trigger the server to make a corresponding check or penalty subsequently. For example, if the recognition result is in the first target value interval, the recognition result may have a low accuracy, and may trigger an administrator to perform manual verification, and if the recognition result is in the second target value interval, it is determined that the recognition result is a malicious account, and may trigger the server to perform automatic number sealing processing, and the like, thereby greatly improving the processing efficiency of the malicious account.
Through on-line data observation, after the embodiment of the invention is applied, the evaluation indexes such as accuracy and recall rate are obviously increased, and the identification accuracy of the embodiment of the invention is more than 96.3% and the recall rate is more than 97% calculated by daily average.
In one possible implementation manner, the training process of the target classification model is roughly divided into basic data processing, feature selection, data aggregation, cluster division, and other processes, referring to fig. 3, in the basic data processing process, data collection of a sample account can be performed, for example, basic features of the account, user behavior features, user portrait, and the like are obtained, and specific implementation thereof can be realized by interacting with a database, the database can adopt architectures such as HDFS (Hadoop Distributed File System) or HIVE (a data warehouse tool based on Hadoop), which is not limited in the embodiment of the present disclosure, in the feature selection process, the features can be subjected to processing such as extraction, encoding, deletion processing, discretization, and dimensionless processing, the implementation of the process can be performed by a Spark algorithm, and the processed features are divided by clustering means such as the like in the data aggregation stage, the clustering algorithm can be specifically carried out by adopting Kmeans, the clustering division can also be applied to the weight of the features, the weight can be obtained by an XGboost algorithm, and finally, the finally obtained model can be provided as a middle station service, so that the method can be suitable for the actual application scene to realize account identification, such as group recognition or account abnormity detection and the like. The specific training process is described in detail below:
step one, obtaining first sample account characteristics of a plurality of sample accounts.
Before the target classification model training is performed, the server may obtain basic data of the sample account, where the basic data may be user profile features, the user profile features may be obtained from a user profile database, the user profile features may be profile information filled during user registration, and may also be information obtained after the user updates the profile information, such as information of user gender, user age, and a region where the user is located, which is not limited in the embodiment of the present disclosure. The basic data may also include some front-end and back-end features of the user, for example, a login device model, a login system version, a login IP address, etc. used by the user, which may be collectively referred to as login features, and in addition, the basic data may also include user behavior features to represent the behavior of the user in the process of login, such as clicking, watching, paying attention, etc., which may cover almost all features of the account, which may achieve a deeper description of the user.
In a possible implementation manner, the server may further obtain a user portrait of the sample account, where the user portrait may be generated based on user data characteristics, historical behavior characteristics, and the like of the account, and in combination with the user portrait and the above characteristics, the characteristic information covering the account in a comprehensive manner may be obtained, so that omission of information is avoided.
It should be noted that the sample account includes a positive sample and a negative sample, the positive sample is a sample account labeled as a non-malicious account, and the negative sample is a sample account labeled as a malicious account.
After a large number of features are selected, some of which are not distinctive, direct use can introduce noise into the model, and therefore, features whose weights are less than the target weight can be deleted from the features. It should be noted that the weight of each feature may be a preset weight, or may be a weight obtained by calculating through a tree model based on each feature, which is not limited in the embodiment of the present disclosure. In one possible implementation, the method further comprises: calculating the weight of the input sample characteristics of the sample accounts through a tree model, deleting the characteristics with the weight smaller than the target weight, and splicing the rest characteristics into the first sample account characteristics of the sample accounts. It should be noted that the target weight may be different for different training processes or recognition processes, and may be determined based on actual conditions of samples used for training, for example, a feature with a weight size located in a preset position may be selected, so that a minimum value of weights corresponding to the selected feature is used as the target weight. The weight calculation process may also be implemented by using an XGBoost, and the weight may refer to a feature importance (feature importance) generated in the XGBoost model calculation process. Because the weight can represent the importance of the feature to subsequent recognition, some features with lower importance are deleted, so that the interference of the features to the recognition can be avoided, and the recognition accuracy is improved. In a possible implementation manner, the greater the value of the weight, the higher the importance, the smaller the weight is than the target weight, which indicates that the importance of the feature is low, the feature is deleted, and the calculation amount of the subsequent encoding and the specific calculation process can be reduced, thereby achieving the purpose of saving resources.
When the features are spliced, the splicing process is the same as that in the account identification process, which can be referred to as step 202, and is not described herein again.
Determining the categories of the sample accounts and the category characteristics of the sample accounts based on the first sample account characteristics, wherein the category characteristics of the sample accounts are used for representing the relationship between the sample accounts and the categories.
The second step may be a clustering process, for example, the clustering process may be implemented by a clustering model, the server may input the multiple first sample account features into the clustering model, so as to cluster the multiple first sample account features by using a clustering algorithm, and a specific clustering manner may be exemplified as follows, according to a given value, selecting the numerical first sample account features as initially-divided cluster centers; calculating the distance from all the first sample account features to each cluster center, and dividing all the first sample account features to the cluster center closest to the first sample account features; calculating the average value of the first sample account number characteristics in each cluster, and taking the average value as a new cluster center; and circularly performing 2-3 steps until the maximum iteration times are reached, or the change of the cluster center is smaller than a certain predefined threshold value, ending the circular iteration process, and obtaining a clustering result. After the clustering is completed, its clustering product, such as a plurality of clusters, the first sample account number characteristic in each cluster, and the closeness degree of the plurality of clusters, etc., may be output.
It should be noted that, before each iteration of the clustering process in the embodiment of the present disclosure, some features whose weights are smaller than the target weight are randomly deleted, and after multiple iterations, a group of features whose Sum of Squared errors (WSSSE) in the Set is the smallest is selected as the features used in the clustering process, that is, when the features whose weights are smaller than the target weight are deleted, other features may also be deleted, so as to obtain the features finally determined through the iteration process. Of course, in the account identification process, only the characteristics of the account can be directly obtained when the characteristics of the account are obtained, so that waste of data and computing resources is avoided.
And thirdly, performing feature fusion on the category features of the sample accounts, and the first sample account features and the label information of the sample accounts to obtain second sample account features of the sample accounts.
The tag information is used to indicate whether the sample account is a malicious account, and the feature fusion is the same as the feature fusion in step 205, which is not described herein again. It should be noted that the feature fusion may further include feature intersection, for example, when any sample feature has multiple category features, the sum or combination between the multiple category features may be obtained as the comprehensive category feature of the sample account, so as to improve the robustness of the model.
And step four, training by adopting the characteristics of a plurality of second sample account numbers of the sample account numbers to obtain the target classification model.
In the training process, in each iteration process, the account features of the second sample are input into the model, the account features of the second sample are calculated according to the current model parameters of the model to output a recognition result, the model parameters are adjusted based on the difference between the recognition result and the label information, the iteration calculation process is performed based on the adjusted model parameters until the iteration stop condition is met, for example, the recognition accuracy reaches the target accuracy, and the like, the model parameters at this time are output as the parameters of the classification model to obtain the target classification model.
In the training process, the characteristics adopted by the training can be determined through multiple rounds of clustering iteration, the clustering iteration result is used for superposing some characteristics with high weight, and then the model is continuously iterated by using the label information, so that the target classification model can identify the malicious account.
In a possible implementation manner, the processes of feature selection, coding and model prediction can be packaged into a pipeline (pipeline), codes of the whole account identification method are packaged and provided to a business party in a black box manner, the business party provides basic data, and model parameters meeting business requirements are obtained through the self-feature extraction and training process of the data.
Fig. 4 is a block diagram illustrating an account identification apparatus according to an example embodiment. Referring to fig. 4, the apparatus includes an acquisition unit 401, a determination unit 402, a feature fusion unit 403, and a recognition unit 404.
An obtaining unit 401 configured to perform obtaining a first account characteristic of an account to be identified;
a determining unit 402 configured to perform determining, based on the first account characteristic, a first category of the account and a category characteristic of the account, where the category characteristic of the account is used to represent a relationship between the account and the first category;
a feature fusion unit 403, configured to perform feature fusion on the category features and the first account features to obtain a second account feature of the account;
the identification unit 404 is configured to perform inputting of the second account characteristics of the account into a target classification model, and obtain an identification result of the account by predicting whether the account is a target type account or not through the target classification model.
In a possible implementation manner, the obtaining unit is configured to obtain a user profile characteristic, a login characteristic, and a user behavior characteristic of an account to be identified; and deleting the features with the weight smaller than the target weight, and acquiring the rest features as the first account features of the account, wherein the target weight is smaller than the target weight.
In a possible implementation manner, the obtaining unit is configured to perform stitching of the user profile characteristic, the login characteristic, and the user behavior characteristic of the account to obtain the first account characteristic of the account.
In a possible implementation manner, the obtaining unit is configured to encode each feature respectively to obtain encoded features, and splice the encoded features to obtain the first account feature of the account.
In a possible implementation manner, the obtaining unit is configured to, when encoding a target feature, segment the target feature to obtain multiple segments of sub-features of the target feature, encode the multiple segments of sub-features respectively, and splice encoding results to obtain the encoded target feature.
In a possible implementation manner, the determining unit is configured to perform inputting the first account feature into a clustering model, and obtain the first category of the account and the category feature of the account according to a distance relationship between the first account feature and a plurality of clusters through the clustering model.
In one possible implementation manner, the determining unit performs parallel computation by a GPU when determining the category based on the first account characteristic.
In one possible implementation, the apparatus further includes: a model training unit configured to perform:
acquiring first sample account characteristics of a plurality of sample accounts; determining a plurality of categories of the sample account numbers and category characteristics of the sample account numbers based on the plurality of first sample account number characteristics, wherein the category characteristics of the sample account numbers are used for representing the relationship between the sample account numbers and the categories; respectively carrying out feature fusion on the category features of the sample accounts and the first sample account features and the label information of the sample accounts to obtain second sample account features of the sample accounts; and training by adopting the second sample account characteristics of the plurality of sample accounts to obtain the target classification model.
In one possible implementation, the apparatus further includes:
a feature processing unit configured to perform calculating weights of input sample features of the plurality of sample accounts through a tree model; and deleting the features with the weight smaller than the target weight, and acquiring the rest features as the first sample account features of the plurality of sample accounts.
In one possible implementation manner, the category characteristics of the account include: a distance between the first account number feature and a cluster center of the first category.
FIG. 5 is a block diagram illustrating a server in accordance with an example embodiment. The server 500 may have a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 501 and one or more memories 502, where at least one instruction is stored in the memory 502, and is loaded and executed by the processor 501 to implement the account identification method provided by each of the above method embodiments. Of course, the server may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input/output, and the server may also include other components for implementing the functions of the device, which are not described herein again.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (10)

1. An account identification method is characterized by comprising the following steps:
acquiring a first account characteristic of an account to be identified;
determining a first category of the account and a category characteristic of the account based on the first account characteristic, wherein the category characteristic of the account is used for representing a relationship between the account and the first category;
performing feature fusion on the category features of the account and the first account features to obtain second account features of the account;
inputting the second account characteristics of the account into a target classification model, and predicting whether the account is of a target type through the target classification model to obtain the identification result of the account.
2. The account identification method according to claim 1, wherein the acquiring the first account characteristic of the account to be identified comprises:
and splicing the user data characteristics, the login characteristics and the user behavior characteristics of the account to obtain the first account characteristics of the account.
3. The account identification method according to claim 2, wherein the obtaining the first account characteristic of the account by splicing the user profile characteristic, the login characteristic and the user behavior characteristic of the account comprises:
and respectively coding each characteristic to obtain coded characteristics, and splicing the coded characteristics to obtain the first account characteristic of the account.
4. The account identification method according to claim 3, further comprising:
when the target features are coded, the target features are segmented to obtain multiple segments of sub-features of the target features, the multiple segments of sub-features are coded respectively, and coding results are spliced to obtain the coded target features.
5. The account identification method of claim 1, wherein the determining the first category of the account and the category characteristic of the account based on the first account characteristic comprises:
inputting the first account characteristics into a clustering model, and obtaining a first category of the account and category characteristics of the account according to the distance relationship between the first account characteristics and a plurality of clusters through the clustering model.
6. The account identification method according to claim 1, wherein the method comprises: and when the category is determined based on the first account characteristics, performing parallel computation through a GPU.
7. The account identification method of claim 1, wherein before entering the second account characteristic of the account into the target classification model, the method further comprises:
acquiring first sample account characteristics of a plurality of sample accounts;
determining a plurality of categories of the sample account numbers and category characteristics of the sample account numbers based on the plurality of first sample account number characteristics, wherein the category characteristics of the sample account numbers are used for representing the relationship between the sample account numbers and the categories;
respectively carrying out feature fusion on the category features of the sample accounts and the first sample account features and the label information of the sample accounts to obtain second sample account features of the sample accounts;
and training by adopting the second sample account characteristics of the plurality of sample accounts to obtain the target classification model.
8. An account identification device, comprising:
the account identification device comprises an acquisition unit, a recognition unit and a recognition unit, wherein the acquisition unit is configured to acquire a first account characteristic of an account to be identified;
a determining unit configured to perform determining a first category of the account and a category feature of the account based on the first account feature, wherein the category feature of the account is used for representing a relationship between the account and the first category;
the feature fusion unit is configured to perform feature fusion on the category features and the first account features to obtain second account features of the account;
and the identification unit is configured to input the second account characteristics of the account into a target classification model, and predict whether the account is a target type account through the target classification model to obtain the identification result of the account.
9. A server, comprising:
a processor;
a memory for storing the processor-executable instructions;
wherein the processor is configured to execute the instructions to implement the account identification method of any of claims 1 to 7.
10. A storage medium in which instructions, when executed by a processor of a server, enable the server to perform the account identification method of any one of claims 1 to 7.
CN201911136455.5A 2019-11-19 2019-11-19 Account identification method, device, server and storage medium Active CN112905987B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911136455.5A CN112905987B (en) 2019-11-19 2019-11-19 Account identification method, device, server and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911136455.5A CN112905987B (en) 2019-11-19 2019-11-19 Account identification method, device, server and storage medium

Publications (2)

Publication Number Publication Date
CN112905987A true CN112905987A (en) 2021-06-04
CN112905987B CN112905987B (en) 2024-02-27

Family

ID=76104647

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911136455.5A Active CN112905987B (en) 2019-11-19 2019-11-19 Account identification method, device, server and storage medium

Country Status (1)

Country Link
CN (1) CN112905987B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117407800A (en) * 2023-09-11 2024-01-16 北京工商大学 Social media robot detection method and system based on random forest and XGBoost model

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106503562A (en) * 2015-09-06 2017-03-15 阿里巴巴集团控股有限公司 A kind of Risk Identification Method and device
CN108418825A (en) * 2018-03-16 2018-08-17 阿里巴巴集团控股有限公司 Risk model training, rubbish account detection method, device and equipment
CN109525595A (en) * 2018-12-25 2019-03-26 广州华多网络科技有限公司 A kind of black production account recognition methods and equipment based on time flow feature
CN110119860A (en) * 2018-02-05 2019-08-13 阿里巴巴集团控股有限公司 A kind of rubbish account detection method, device and equipment
CN110198310A (en) * 2019-05-20 2019-09-03 腾讯科技(深圳)有限公司 A kind of anti-cheat method of network behavior, device and storage medium
CN110225036A (en) * 2019-06-12 2019-09-10 北京奇艺世纪科技有限公司 A kind of account detection method, device, server and storage medium
CN110232630A (en) * 2019-05-29 2019-09-13 腾讯科技(深圳)有限公司 The recognition methods of malice account, device and storage medium
CN110399925A (en) * 2019-07-26 2019-11-01 腾讯科技(武汉)有限公司 Risk Identification Method, device and the storage medium of account

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106503562A (en) * 2015-09-06 2017-03-15 阿里巴巴集团控股有限公司 A kind of Risk Identification Method and device
CN110119860A (en) * 2018-02-05 2019-08-13 阿里巴巴集团控股有限公司 A kind of rubbish account detection method, device and equipment
CN108418825A (en) * 2018-03-16 2018-08-17 阿里巴巴集团控股有限公司 Risk model training, rubbish account detection method, device and equipment
CN109525595A (en) * 2018-12-25 2019-03-26 广州华多网络科技有限公司 A kind of black production account recognition methods and equipment based on time flow feature
CN110198310A (en) * 2019-05-20 2019-09-03 腾讯科技(深圳)有限公司 A kind of anti-cheat method of network behavior, device and storage medium
CN110232630A (en) * 2019-05-29 2019-09-13 腾讯科技(深圳)有限公司 The recognition methods of malice account, device and storage medium
CN110225036A (en) * 2019-06-12 2019-09-10 北京奇艺世纪科技有限公司 A kind of account detection method, device, server and storage medium
CN110399925A (en) * 2019-07-26 2019-11-01 腾讯科技(武汉)有限公司 Risk Identification Method, device and the storage medium of account

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117407800A (en) * 2023-09-11 2024-01-16 北京工商大学 Social media robot detection method and system based on random forest and XGBoost model

Also Published As

Publication number Publication date
CN112905987B (en) 2024-02-27

Similar Documents

Publication Publication Date Title
CN110198310B (en) Network behavior anti-cheating method and device and storage medium
CN111444952B (en) Sample recognition model generation method, device, computer equipment and storage medium
WO2022252363A1 (en) Data processing method, computer device and readable storage medium
CN109741065A (en) A kind of payment risk recognition methods, device, equipment and storage medium
CN110659744A (en) Training event prediction model, and method and device for evaluating operation event
CN112580952A (en) User behavior risk prediction method and device, electronic equipment and storage medium
CN111199474A (en) Risk prediction method and device based on network diagram data of two parties and electronic equipment
CN111970400B (en) Crank call identification method and device
CN112801155B (en) Business big data analysis method based on artificial intelligence and server
KR102359090B1 (en) Method and System for Real-time Abnormal Insider Event Detection on Enterprise Resource Planning System
WO2023169274A1 (en) Data processing method and device, and storage medium and processor
CN115860836B (en) E-commerce service pushing method and system based on user behavior big data analysis
CN112883990A (en) Data classification method and device, computer storage medium and electronic equipment
CN115687732A (en) User analysis method and system based on AI and stream computing
CN114841705B (en) Anti-fraud monitoring method based on scene recognition
CN113592593A (en) Training and application method, device, equipment and storage medium of sequence recommendation model
CN117409419A (en) Image detection method, device and storage medium
EP4273750A1 (en) Data processing method and apparatus, computing device, and test simplification device
CN114328942A (en) Relationship extraction method, apparatus, device, storage medium and computer program product
CN112784008B (en) Case similarity determining method and device, storage medium and terminal
CN111444364B (en) Image detection method and device
CN112905987A (en) Account identification method, account identification device, server and storage medium
CN110704614B (en) Information processing method and device for predicting user group type in application
CN117196630A (en) Transaction risk prediction method, device, terminal equipment and storage medium
CN114399352A (en) Information recommendation method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant