CN112905987B - Account identification method, device, server and storage medium - Google Patents

Account identification method, device, server and storage medium Download PDF

Info

Publication number
CN112905987B
CN112905987B CN201911136455.5A CN201911136455A CN112905987B CN 112905987 B CN112905987 B CN 112905987B CN 201911136455 A CN201911136455 A CN 201911136455A CN 112905987 B CN112905987 B CN 112905987B
Authority
CN
China
Prior art keywords
account
sample
feature
category
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911136455.5A
Other languages
Chinese (zh)
Other versions
CN112905987A (en
Inventor
郗剑亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Dajia Internet Information Technology Co Ltd
Original Assignee
Beijing Dajia Internet Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Dajia Internet Information Technology Co Ltd filed Critical Beijing Dajia Internet Information Technology Co Ltd
Priority to CN201911136455.5A priority Critical patent/CN112905987B/en
Publication of CN112905987A publication Critical patent/CN112905987A/en
Application granted granted Critical
Publication of CN112905987B publication Critical patent/CN112905987B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/30Authentication, i.e. establishing the identity or authorisation of security principals
    • G06F21/45Structures or tools for the administration of authentication
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/552Detecting local intrusion or implementing counter-measures involving long-term monitoring or reporting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/554Detecting local intrusion or implementing counter-measures involving event detection and direct action

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The disclosure relates to an account identification method, an account identification device, a server and a storage medium. According to the method and the device, the account is classified to determine some characteristics on the whole, the characteristics of the account are combined with the characteristics of the category generated during classification and the characteristics with higher importance, the characteristics which can more comprehensively represent the account are obtained, and classification is carried out based on the characteristics, so that the account is identified through some characteristics of the learned malicious account, and the accuracy of identifying the malicious account can be greatly improved. The above process combines the supervised and unsupervised modes, creates a cascade processing method, fully utilizes the unsupervised characteristic to obtain the characteristics of the account on the category, and further classifies the category obtained by the unsupervised mode again by the supervised mode so as to achieve the purpose of accurate division.

Description

Account identification method, device, server and storage medium
Technical Field
The disclosure relates to the field of network technologies, and in particular, to an account identification method, an account identification device, a server and a storage medium.
Background
In many internet application scenarios, such as e-commerce scenarios, virtual social scenarios, financial service scenarios, video websites, etc., some people can maliciously register many accounts based on false information to perform illegal actions such as malicious bill swiping and fraud, so that the accounts need to be identified to maintain interests of users, merchants and operators.
The current account identification can be generally realized by setting an identification rule, and an account with account information which does not accord with the identification rule is identified as a maliciously registered account. However, the above identification method has the advantages of easy configuration, etc., but is very easy to bypass, resulting in lower accuracy of identification.
Disclosure of Invention
The disclosure provides an account identification method, an account identification device, a server and a storage medium, so as to at least solve the problem of low identification accuracy in the related art. The technical scheme of the present disclosure is as follows:
in a first aspect, an account identification method is provided, including:
acquiring a first account characteristic of an account to be identified; determining a first category of the account and category characteristics of the account based on the first account characteristics, wherein the category characteristics of the account are used for representing a relationship between the account and the first category; carrying out feature fusion on the category features and the first account features to obtain second account features of the account; and inputting the second account characteristics of the account into a target classification model, and predicting whether the account is a target type account or not through the target classification model to obtain an identification result of the account.
In one possible implementation manner, the obtaining the first account feature of the account to be identified includes: and splicing the user data characteristics, the login characteristics and the user behavior characteristics of the account to obtain a first account characteristic of the account.
In one possible implementation manner, the splicing the user profile feature, the login feature and the user behavior feature of the account to obtain the first account feature of the account includes: and respectively encoding each characteristic to obtain encoded characteristics, and splicing the encoded characteristics to obtain first account characteristics of the account.
In one possible implementation, the method further includes: when the target feature is encoded, the target feature is segmented to obtain multiple segments of sub-features of the target feature, the multiple segments of sub-features are encoded respectively, and encoding results are spliced to obtain the encoded target feature.
In one possible implementation manner, the determining the first category of the account and the category characteristic of the account based on the first account characteristic includes: and inputting the first account number characteristics into a clustering model, and obtaining a first category of the account number and category characteristics of the account number according to the distance relation between the first account number characteristics and the clusters through the clustering model.
In one possible implementation, the method includes: and when the category is determined based on the first account number characteristics, performing parallel calculation through the GPU.
In one possible implementation manner, before the second account feature of the account is input into the target classification model, the method further includes: acquiring first sample account characteristics of a plurality of sample accounts; based on the plurality of first sample account characteristics, determining categories of a plurality of sample accounts and category characteristics of the plurality of sample accounts, wherein the category characteristics of the sample accounts are used for representing the relation between the sample accounts and the categories; respectively carrying out feature fusion on the category features of the plurality of sample accounts, the plurality of first sample account features and the label information of the plurality of sample accounts to obtain second sample account features of the plurality of sample accounts; training by adopting second sample account characteristics of a plurality of sample accounts to obtain the target classification model.
In one possible implementation, the method further includes: and calculating the weight of the input sample characteristics of the plurality of sample accounts through the tree model, deleting the characteristics with the weight smaller than the target weight, and acquiring the rest characteristics as the first sample account characteristics of the plurality of sample accounts.
In a second aspect, an account identification device is provided, including:
the acquisition unit is configured to acquire first account characteristics of an account to be identified;
a determining unit configured to perform determining a first category of the account and a category characteristic of the account based on the first account characteristic, the category characteristic of the account being used to represent a relationship between the account and the first category;
the feature fusion unit is configured to perform feature fusion on the category features and the first account features to obtain second account features of the account;
the identification unit is configured to input the second account characteristics of the account into a target classification model, and predict whether the account is a target type account or not through the target classification model to obtain an identification result of the account.
In one possible implementation manner, the obtaining unit is configured to perform stitching of the user profile feature, the login feature and the user behavior feature of the account, so as to obtain a first account feature of the account.
In one possible implementation manner, the obtaining unit is configured to encode each feature respectively to obtain an encoded feature, and splice the encoded features to obtain a first account feature of the account.
In one possible implementation manner, the obtaining unit is configured to segment the target feature when encoding the target feature, obtain multiple segments of sub-features of the target feature, encode the multiple segments of sub-features respectively, and splice encoding results to obtain the encoded target feature.
In one possible implementation manner, the determining unit is configured to perform inputting the first account feature into a clustering model, and obtain, through the clustering model, a first category of the account and a category feature of the account according to a distance relationship between the first account feature and a plurality of clusters.
In one possible implementation manner, the determining unit performs parallel computation through the GPU when determining the category based on the first account feature.
In one possible implementation, the apparatus further includes: a model training unit configured to perform:
acquiring first sample account characteristics of a plurality of sample accounts; determining a category of the plurality of sample accounts and a category characteristic of the plurality of sample accounts based on a first target sample characteristic in the plurality of first sample account characteristics, wherein the category characteristic of the sample account is used for representing a relation between the sample account and the category; respectively carrying out feature fusion on the category features of the plurality of sample accounts and second target sample features in the plurality of first sample account features and the label information of the plurality of sample accounts to obtain second sample account features of the plurality of sample accounts, wherein the weight of the second target sample features is greater than or equal to the target weight; training by adopting second sample account characteristics of the plurality of sample accounts to obtain the target classification model.
In one possible implementation, the apparatus further includes:
a feature processing unit configured to perform a calculation of weights of input sample features of the plurality of sample accounts by a tree model; deleting the characteristics with the weight smaller than the target weight, and acquiring the remaining characteristics as the first sample account characteristics of the plurality of sample accounts.
According to a third aspect of embodiments of the present disclosure, there is provided a server comprising: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to execute instructions to implement an account identification method as in any one of the above.
According to a fourth aspect of embodiments of the present disclosure, there is provided a storage medium, which when executed by a processor of a server, enables the server to perform the account identification method as any one of the above.
According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product comprising executable instructions which, when executed by a processor of a server, enable the server to perform the account identification method as any one of the above.
The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects: the account is classified to determine some characteristics on the whole, and then the characteristics of the account are combined with the characteristics of the category generated during classification and the characteristics with higher importance, so that the characteristics capable of comprehensively representing the account are obtained, and classification is performed based on the characteristics, so that the identification of the account is realized through some characteristics of the learned malicious account, and the accuracy and recall rate of the identification of the malicious account can be greatly improved. The above process combines the supervised and unsupervised modes, creates a cascade processing method, fully utilizes the unsupervised characteristic to obtain the characteristics of the account on the category, and further classifies the category obtained by the unsupervised mode again by the supervised mode so as to achieve the purpose of accurate division.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure and do not constitute an undue limitation on the disclosure.
Fig. 1 is a flow chart illustrating a method of account identification according to an exemplary embodiment.
Fig. 2 is a flow chart illustrating a method of account identification according to an exemplary embodiment.
Fig. 3 is a schematic diagram illustrating a number of different technical processes involved in an account identification process according to an example embodiment.
Fig. 4 is a block diagram illustrating an account number recognition apparatus according to an exemplary embodiment.
Fig. 5 is a block diagram of a server, according to an example embodiment.
Detailed Description
In order to enable those skilled in the art to better understand the technical solutions of the present disclosure, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.
It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.
The user information referred to in the present disclosure may be information authorized by the user or sufficiently authorized by each party.
Fig. 1 is a flowchart illustrating an account identification method according to an exemplary embodiment, which is used in a server as shown in fig. 1, and includes the following steps.
In step 101, a first account feature of an account to be identified is obtained.
In step 102, a first category of the account and a category characteristic of the account are determined based on the first account characteristic, the category characteristic of the account being used to represent a relationship between the account and the first category.
In step 103, feature fusion is performed on the category features of the account and the first account features, so as to obtain second account features of the account.
In step 104, the second account feature of the account is input into a target classification model, and whether the account is a target type account is predicted through the target classification model, so as to obtain the recognition result of the account.
According to the method provided by the embodiment of the disclosure, the account is classified to determine some characteristics on the whole, the characteristics of the category generated during classification are combined with some characteristics with higher importance, the characteristics of the account can be comprehensively represented, and classification is performed based on the characteristics, so that the identification of the account is realized through some characteristics of the learned malicious account, and the identification accuracy of the malicious account can be greatly improved. The above process combines the supervised and unsupervised modes, creates a cascade processing method, fully utilizes the unsupervised characteristic to obtain the characteristics of the account on the category, and further classifies the category obtained by the unsupervised mode again by the supervised mode so as to achieve the purpose of accurate division.
In one possible implementation manner, the obtaining the first account feature of the account to be identified includes: and splicing the user data characteristics, the login characteristics and the user behavior characteristics of the account to obtain a first account characteristic of the account.
In one possible implementation manner, the splicing the user profile feature, the login feature and the user behavior feature of the account to obtain the first account feature of the account includes: and respectively encoding each characteristic to obtain encoded characteristics, and splicing the encoded characteristics to obtain first account characteristics of the account.
In one possible implementation, the method further includes:
when the target feature is encoded, the target feature is segmented to obtain multiple segments of sub-features of the target feature, the multiple segments of sub-features are encoded respectively, and encoding results are spliced to obtain the encoded target feature.
In one possible implementation manner, the determining, based on the first account feature, the first category of the account and the category feature of the account includes:
and inputting the first account characteristics into a clustering model, and obtaining a first category of the account and category characteristics of the account according to the distance relation between the first account characteristics and the clusters through the clustering model.
In one possible implementation, the method includes: and when the category is determined based on the first account number characteristics, performing parallel calculation through the GPU.
In one possible implementation manner, before the inputting the second account feature of the account into the target classification model, the method further includes:
acquiring first sample account characteristics of a plurality of sample accounts;
determining a plurality of categories of the sample accounts and category characteristics of the sample accounts based on the plurality of first sample account characteristics, wherein the category characteristics of the sample accounts are used for representing the relation between the sample accounts and the categories;
respectively carrying out feature fusion on the category features of the plurality of sample accounts and the tag information of the plurality of first sample account features and the plurality of sample accounts to obtain second sample account features of the plurality of sample accounts;
training by adopting second sample account characteristics of the plurality of sample accounts to obtain the target classification model.
In one possible implementation, the method further includes: and calculating the weight of the input sample characteristics of the plurality of sample accounts through a tree model, deleting the characteristics with the weight smaller than the target weight, and splicing the rest characteristics into the first sample account characteristics of the plurality of sample accounts.
Any combination of the above-mentioned optional solutions may be adopted to form an optional embodiment of the present disclosure, which is not described herein in detail.
Fig. 2 is a flowchart illustrating an account identification method according to an exemplary embodiment, which is used in a server as shown in fig. 2, and includes the following steps.
In step 201, the server obtains user profile features, login features, and user behavior features of an account to be identified.
When the account number is identified, the server may obtain basic data of the account number to be identified, where the basic data may be user data features, where the user data features may be obtained from a user data database, and may be information obtained by updating data information filled in during user registration, for example, information of a user gender, a user age, a region where the user is located, and the embodiment of the disclosure is not limited. The basic data may further include some front-end and back-end characteristics of the user, for example, a login device model, a login system version, a login IP address, etc. used by the user, where such data may be collectively referred to as login characteristics, and in addition, the basic data may further include user behavior characteristics to indicate clicking, watching, focusing, etc. performed by the user during the login, where the foregoing characteristics may cover almost all the characteristics of the account, and may achieve a deeper description of the user.
In one possible implementation manner, the server may further obtain a user portrait of the account, where the user portrait may be generated based on user data features and historical behavior features of the account, and by combining the user portrait and the features, feature information covering the whole account may be obtained, so that information omission is avoided.
In step 202, each feature of the server is encoded to obtain an encoded feature.
For each of the features acquired above, the features may also be encoded to represent the features in a unified form, for different classes of features, the codes may be coded in a coding manner corresponding to the feature, and for example, the codes may be coded as a vector of a fixed length for gender, for female gender, (0, 0) for male gender, and (0, 1) for gender. Of course, the lengths of the vector representations may not be equal for different classes of features, which are not limiting embodiments of the present disclosure.
Some of the above features are continuous features, and some of the features are discrete features, and for the discrete features, some simplification processes can be performed to achieve the purposes of reducing the calculation amount and improving the calculation efficiency when encoding, for example, for a target feature, folding encoding can be performed on the target feature, that is, the target feature is segmented to obtain multiple segments of sub-features of the target feature, the multiple segments of sub-features are respectively encoded, and the encoding results are spliced to obtain the encoded target feature.
Taking the target feature as the login IP as an example, the login IP address can be divided into 4 sections, each section is independently coded, namely, the coding bit number is shortened to 255 x 4 from 255 x 4, the compression ratio is about 400 ten thousand, and the coding mode can not introduce too much error to the subsequent model training, so that the problem of overlong coding bit number caused by directly carrying out dumb coding is avoided, the calculation efficiency of the whole model is not influenced, the occupied amount of the memory is greatly reduced, and the purposes of reducing the calculation amount and improving the calculation efficiency are achieved.
In some possible implementation manners, the above encoding process may use one-hot encoding, and the encoding manner may be suitable for the situation that the feature dimension is not too high, and the encoding manner is simple, so that encoding time consumption can be greatly reduced.
In step 203, the server splices the encoded features to obtain a first account feature of the account.
In order to represent the features of an account, the features may be spliced together in the above-mentioned splicing manner, and as a description of the account, based on the above-mentioned vector representation, feature vectors obtained by encoding the features may be spliced in a predetermined order.
It should be noted that, for some accounts, some of the accounts may lack some features, and for such accounts, the missing features may be supplemented, for example, the missing features are mapped to preset vectors corresponding to the features, and the preset vectors are used as feature vectors of the missing features to perform the stitching step when stitching.
For example, for category features (such as user gender, mobile phone system version, etc.) and numerical features (such as user registration time stamp, use duration, etc.), the feature vectors may be obtained by encoding and splicing the feature vectors as follows: (1000,[2,3,9,100,999],[3.0,1.0,0.0,101.25,0.1]).
In step 204, the server inputs the first account feature into a clustering model, and obtains a first category of the account and category features of the account according to a distance relationship between the first account feature and a plurality of clusters through the clustering model.
When the clustering model is used for clustering, the first account feature can be clustered into a cluster which is closest to the cluster center and belongs to the cluster center based on the distance relation between the first account feature and the cluster center of each cluster, and the serial number of the cluster and the distance between the first account feature and the cluster center are output to represent the relation between the account and the category to which the account belongs.
The step 204 is a process of determining the first category of the account and the category characteristics of the account based on the first account characteristics, and the embodiment of the disclosure may implement the clustering model by using kmeans++ algorithm, which is simple to implement and has a fast convergence rate, and of course, the clustering model may be further constructed based on any clustering algorithm such as KMeans, DBSCAN, GMM, and may be flexibly selected according to the requirements of the service scenario and the characteristics of the clustering algorithm, which is not limited in the embodiment of the disclosure. In some possible implementations, the server may also use the GPU to perform parallel computation when clustering, so as to ensure that data output is completed on time in the offline situation.
In step 205, the server performs feature fusion on the category feature of the account and the first account feature to obtain a second account feature of the account.
The classification feature obtained by the clustering may be encoded before feature fusion, and the encoded classification feature may be used for feature fusion. For example, the encoded category feature may be [2001,105,100], and then based on the first account feature (1000, [2,3,9,100,999], [3.0,1.0,0.0,101.25,0.1 ]) and the feature vector of the category feature in the above example, the following features may be spliced:
(1000,[2,3,9,100,999],[3.0,1.0,0.0,101.25,0.1],[2001,105,100])。
In step 206, the server inputs the second account feature of the account into a target classification model, and predicts whether the account is a target type account through the target classification model, so as to obtain the identification result of the account.
The target classification model can be a model which has learned the characteristics of the malicious account, the training data adopted by the target classification model can comprise the characteristics of the same reason, and tag information is added to realize supervised learning, so that the aim of identifying whether the account is the malicious account is fulfilled. Embodiments of the present disclosure will be described in detail below with respect to the process of how a model is specifically trained.
The target classification model can be used for predicting whether the account is a target type account, and the prediction result can be a classification result, that is, a yes or no result can be output, and of course, a multi-classification result can be adopted to improve the scene applicability of the model, so that the model can be applied to various different recognition scenes. Of course, the classification algorithm applied by the classification model may also have various choices, for example RF, GBDT, LR, NN, and may be selected in combination with a service scenario, and the embodiment of the disclosure does not limit what algorithm is specifically adopted.
According to the method provided by the embodiment of the disclosure, the account is classified to determine some characteristics on the whole, the characteristics of the category generated during classification are combined with some characteristics with higher importance, the characteristics of the account can be comprehensively represented, and classification is performed based on the characteristics, so that the identification of the account is realized through some characteristics of the learned malicious account, and the identification accuracy of the malicious account can be greatly improved. The above process combines the supervised and unsupervised modes, creates a cascade processing method, fully utilizes the unsupervised characteristic to obtain the characteristics of the account on the category, and further classifies the category obtained by the unsupervised mode again by the supervised mode so as to achieve the purpose of accurate division. Further, in the processing of some discrete high-dimensional features during encoding, an encoding mode for shortening the number of feature encoding bits is adopted, so that more and more effective information can be reserved in the dimension as low as possible, the calculated amount is greatly reduced, the calculation resources are saved, and the response result can be given out more quickly. Further, when the features are selected, effective features are selected to be clustered based on the importance of the features, so that interference of the ineffective features on the identification process is avoided.
Further, the difference of the above identification results can trigger the server to make corresponding verification or penalty subsequently. For example, if the identification result is in the first target value interval, the accuracy of the identification result may be low, the administrator may be triggered to perform manual verification, and if the identification result is in the second target value interval, it is determined that the identification result is actually a malicious account, the server may be triggered to automatically perform number sealing processing and the like, so that the processing efficiency of the malicious account is greatly improved.
Through online data observation, after the embodiment of the present disclosure is applied, the accuracy rate, the recall rate and other evaluation indexes are obviously increased, and the recognition accuracy rate of the embodiment of the present disclosure is above 96.3% and the recall rate is above 97% calculated by daily average.
In one possible implementation manner, the training process of the target classification model is generally divided into basic data processing, feature selection, data aggregation, cluster division and the like, in the basic data processing process, data collection of a sample account can be performed, for example, basic features of the account, user behavior features, user portraits and the like are obtained, a specific implementation manner can be realized by interaction with a database, the database can adopt architecture such as HDFS (Hadoop Distributed File System ) or HIVE (a data warehouse tool based on Hadoop) and the like, the embodiment of the disclosure does not limit the architecture, in the feature selection process, the processes such as feature extraction, encoding, deletion processing, discretization, dimensionless processing and the like can be performed, the processed features can be subjected to cluster division in a data aggregation stage through means such as clustering, the clustering algorithm can be particularly performed by adopting Kmeans, the cluster time sharing can also be applied to the weight of the features, and finally the weight can be applied to the actual account number table can be obtained, and finally the actual account number table can be identified, or the abnormal account number can be provided for recognition, for example, and the abnormal situation can be provided. Next, the specific training process is specifically described as follows:
Step one, acquiring first sample account characteristics of a plurality of sample accounts.
Before training the target classification model, the server may obtain basic data of the sample account, where the basic data may be user data features, where the user data features may be obtained from a user data database, may be information filled in during user registration, and may also be information obtained after the user updates the data information, for example, information about gender of the user, age of the user, region where the user is located, and the embodiment of the disclosure is not limited. The basic data may further include some front-end and back-end characteristics of the user, for example, a login device model, a login system version, a login IP address, etc. used by the user, where such data may be collectively referred to as login characteristics, and in addition, the basic data may further include user behavior characteristics to indicate clicking, watching, focusing, etc. performed by the user during the login, where the foregoing characteristics may cover almost all the characteristics of the account, and may achieve a deeper description of the user.
In one possible implementation manner, the server may further obtain a user portrait of the sample account, where the user portrait may be generated based on user data features and historical behavior features of the account, and by combining the user portrait and the features, feature information covering the whole account may be obtained, so that information omission is avoided.
It should be noted that the sample account includes a positive sample and a negative sample, the positive sample is a sample account marked as a non-malicious account, and the negative sample is a sample account marked as a malicious account.
After a large number of features are selected, part of the features are not distinguishable, noise is introduced into the model by direct use, and therefore, the features with the weight less than the target weight can be deleted from the features. It should be noted that the weight of each feature may be a preset weight, or may be a weight obtained by calculating through a tree model based on each feature, which is not limited in the embodiment of the present disclosure. In one possible implementation, the method further includes: and calculating the weight of the input sample characteristics of the plurality of sample accounts through the tree model, deleting the characteristics with the weight smaller than the target weight, and splicing the rest characteristics into the first sample account characteristics of the plurality of sample accounts. It should be noted that the target weight may be different for different training processes or recognition processes, and may be determined based on the actual situation of the sample used in training, for example, a feature with a weight size located in a pre-set position may be selected, so that a minimum value in weights corresponding to the selected feature is used as the target weight. The weight calculation process may also be implemented by XGBoost, where the weight may refer to a feature importance (feature importance) generated in the XGBoost model calculation process. Because the weight can represent the importance of the feature to the subsequent recognition, deleting some features with lower importance can avoid the interference of the features to the recognition, thereby improving the recognition accuracy. In one possible implementation, a larger value of the weight indicates a higher importance, and a smaller weight than the target weight, which indicates that the importance of the feature is lower, and deleting such feature can reduce the calculation amount of the subsequent coding and the specific calculation process, thereby achieving the purpose of saving resources.
When the features are spliced, the splicing process is the same as the splicing process in the account identification process, and reference may be made to step 202, which is not described herein.
And step two, determining the categories of a plurality of sample accounts and the category characteristics of the plurality of sample accounts based on the characteristics of the plurality of first sample accounts, wherein the category characteristics of the sample accounts are used for representing the relation between the sample accounts and the categories.
The second step may be a clustering process, for example, may be implemented by a clustering model, and the server may input the plurality of first sample account features into the clustering model, so as to cluster the plurality of first sample account features by using a clustering algorithm, where a specific clustering manner may be exemplified as follows, and according to a given numerical value, the numerical value first sample account features are selected as cluster centers of initial division; calculating the distance from all the first sample account features to the center of each cluster, and dividing all the first sample account features to the cluster center closest to the center; calculating the average value of the first sample account number characteristics in each cluster, and taking the average value as a new cluster center; and (3) circularly performing 2-3 steps until the maximum iteration times are reached or the change of the cluster center is smaller than a certain predefined threshold value, ending the circular iteration process, and obtaining a clustering result. After the clustering is completed, the clustering products thereof, such as a plurality of clusters, first sample account characteristics in each cluster, and the tightness degree of the plurality of clusters, can be output.
It should be noted that, before each iteration of the clustering process in the embodiment of the disclosure, some features with weights smaller than the target weight are randomly deleted, after multiple iterations, a group of features with the smallest value of the sum of squares of errors (Within Set Sum of Squared Error, WSSSE) in the set is selected as the features adopted in the clustering process, that is, when the features with weights smaller than the target weight are deleted, other features can be deleted, so as to obtain the features finally determined through the iteration process. Of course, for the above account identification process, only the account feature can be obtained directly when the account feature is obtained, so as to avoid waste of data and computing resources.
And thirdly, respectively carrying out feature fusion on the category features of the plurality of sample accounts, the plurality of first sample account features and the label information of the plurality of sample accounts to obtain second sample account features of the plurality of sample accounts.
The tag information is used to indicate whether the sample account is a malicious account, and the feature fusion is identical to the feature fusion of the step 205, which is not described herein. It should be noted that, the feature fusion may further include feature intersection, for example, when any sample feature has multiple category features, an and or combination between the multiple category features may be obtained as the comprehensive category feature of the sample account, so as to improve the robustness of the model.
Training by adopting second sample account characteristics of a plurality of sample accounts to obtain the target classification model.
In the training process, the second sample account number features can be input into the model in each iteration process, the current model parameters of the model are used for calculating the second sample account number features to output the identification result, the model parameters are adjusted based on the difference between the identification result and the label information, the iteration calculation process is carried out again based on the adjusted model parameters until iteration stopping conditions are met, for example, the identification accuracy rate reaches the target accuracy rate, and the model parameters of the time are output as parameters of the classification model to obtain the target classification model.
In the training process, the characteristics adopted in training can be determined through multiple rounds of clustering iteration, then the clustering iteration results are used for superposing some high-weight characteristics, and then the iteration model is continuously carried out by adding the label information, so that the target classification model can identify the malicious account.
In one possible implementation manner, the processes of feature selection, coding and model prediction can be packaged into a pipeline (ppline), the code package of the whole account identification method is provided for a business party in a black box mode, basic data are provided by the business party, and model parameters meeting business requirements are obtained through the self-feature extraction and training process of the data, namely, the embodiment of the disclosure can provide a middle platform service, different business party accesses only need to provide basic data according to rules, the design mode of a small foreground and a large middle platform is met, and a plurality of business lines can be efficiently supported.
Fig. 4 is a block diagram illustrating an account number recognition apparatus according to an exemplary embodiment. Referring to fig. 4, the apparatus includes an acquisition unit 401, a determination unit 402, a feature fusion unit 403, and an identification unit 404.
An obtaining unit 401 configured to perform obtaining a first account feature of an account to be identified;
a determining unit 402, configured to determine a first category of the account and a category characteristic of the account based on the first account characteristic, where the category characteristic of the account is used to represent a relationship between the account and the first category;
a feature fusion unit 403, configured to perform feature fusion on the category feature and the first account feature, so as to obtain a second account feature of the account;
the identifying unit 404 is configured to perform inputting the second account feature of the account into a target classification model, and predict whether the account is a target type account according to the target classification model, so as to obtain an identification result of the account.
In one possible implementation manner, the obtaining unit is configured to obtain a user profile feature, a login feature, and a user behavior feature of an account to be identified; deleting the features with the weight smaller than the target weight, and acquiring the remaining features as first account features of the account, wherein the target weight is smaller than the target weight.
In one possible implementation manner, the obtaining unit is configured to perform stitching of the user profile feature, the login feature and the user behavior feature of the account, so as to obtain a first account feature of the account.
In one possible implementation manner, the obtaining unit is configured to encode each feature respectively to obtain an encoded feature, and splice the encoded features to obtain a first account feature of the account.
In one possible implementation manner, the obtaining unit is configured to segment the target feature when encoding the target feature, obtain multiple segments of sub-features of the target feature, encode the multiple segments of sub-features respectively, and splice encoding results to obtain the encoded target feature.
In one possible implementation manner, the determining unit is configured to perform inputting the first account feature into a clustering model, and obtain, through the clustering model, a first category of the account and a category feature of the account according to a distance relationship between the first account feature and a plurality of clusters.
In one possible implementation manner, the determining unit performs parallel computation through the GPU when determining the category based on the first account feature.
In one possible implementation, the apparatus further includes: a model training unit configured to perform:
acquiring first sample account characteristics of a plurality of sample accounts; determining a plurality of categories of the sample accounts and category characteristics of the sample accounts based on the plurality of first sample account characteristics, wherein the category characteristics of the sample accounts are used for representing the relation between the sample accounts and the categories; respectively carrying out feature fusion on the category features of the plurality of sample accounts and the tag information of the plurality of first sample account features and the plurality of sample accounts to obtain second sample account features of the plurality of sample accounts; training by adopting second sample account characteristics of the plurality of sample accounts to obtain the target classification model.
In one possible implementation, the apparatus further includes:
a feature processing unit configured to perform a calculation of weights of input sample features of the plurality of sample accounts by a tree model; deleting the characteristics with the weight smaller than the target weight, and acquiring the remaining characteristics as the first sample account characteristics of the plurality of sample accounts.
In one possible implementation, the category characteristics of the account number include: a distance between the first account feature and a cluster center of the first category.
Fig. 5 is a block diagram of a server, according to an example embodiment. The server 500 may have a relatively large difference due to different configurations or performances, and may include one or more processors (central processing units, CPU) 501 and one or more memories 502, where at least one instruction is stored in the memories 502, and the at least one instruction is loaded and executed by the processors 501 to implement the account identification method provided in the foregoing method embodiments. Of course, the server may also have a wired or wireless network interface, a keyboard, an input/output interface, and other components for implementing the functions of the device, which are not described herein.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (16)

1. An account number identification method is characterized by comprising the following steps:
acquiring a first account characteristic of an account to be identified;
determining a first category of the account and category characteristics of the account based on the first account characteristics, wherein the category characteristics of the account are used for representing the relation between the account and the first category;
carrying out feature fusion on the category features of the account and the first account features to obtain second account features of the account;
inputting the second account characteristics of the account into a target classification model, and predicting whether the account is a target type account through the target classification model to obtain an identification result of the account, wherein the training process of the target classification model comprises the following steps: acquiring first sample account characteristics of a plurality of sample accounts; determining a plurality of categories of the sample accounts and category characteristics of the sample accounts based on the plurality of first sample account characteristics, wherein the category characteristics of the sample accounts are used for representing the relation between the sample accounts and the categories; respectively carrying out feature fusion on the category features of the plurality of sample accounts and the tag information of the plurality of first sample account features and the plurality of sample accounts to obtain second sample account features of the plurality of sample accounts; training by adopting second sample account characteristics of the plurality of sample accounts to obtain the target classification model.
2. The account identification method according to claim 1, wherein the obtaining the first account feature of the account to be identified includes:
and splicing the user data characteristics, the login characteristics and the user behavior characteristics of the account to obtain a first account characteristic of the account.
3. The method for identifying an account according to claim 2, wherein the splicing the user profile feature, the login feature, and the user behavior feature of the account to obtain the first account feature of the account includes:
and respectively encoding each characteristic to obtain encoded characteristics, and splicing the encoded characteristics to obtain first account characteristics of the account.
4. An account number identification method as claimed in claim 3, further comprising:
when the target feature is encoded, the target feature is segmented to obtain multiple segments of sub-features of the target feature, the multiple segments of sub-features are encoded respectively, and encoding results are spliced to obtain the encoded target feature.
5. The account identification method of claim 1, wherein the determining the first category of the account and the category characteristics of the account based on the first account characteristics comprises:
And inputting the first account characteristics into a clustering model, and obtaining a first category of the account and category characteristics of the account according to the distance relation between the first account characteristics and the clusters through the clustering model.
6. An account number identification method according to claim 1, characterized in that the method comprises: and when the category is determined based on the first account number characteristics, performing parallel calculation through the GPU.
7. An account number identification method as claimed in claim 1, further comprising: and calculating the weight of the input sample characteristics of the plurality of sample accounts through a tree model, deleting the characteristics with the weight smaller than the target weight, and splicing the rest characteristics into the first sample account characteristics of the plurality of sample accounts.
8. An account number recognition device, comprising:
the acquisition unit is configured to acquire first account characteristics of an account to be identified;
a determining unit configured to determine a first category of the account and a category characteristic of the account based on the first account characteristic, the category characteristic of the account being used to represent a relationship between the account and the first category;
The feature fusion unit is configured to perform feature fusion on the category features and the first account features to obtain second account features of the account;
a model training unit configured to perform: acquiring first sample account characteristics of a plurality of sample accounts; determining a category of the plurality of sample accounts and a category characteristic of the plurality of sample accounts based on a first target sample characteristic in the plurality of first sample account characteristics, wherein the category characteristic of the sample account is used for representing a relation between the sample account and the category; respectively carrying out feature fusion on the category features of the plurality of sample accounts, second target sample features in the plurality of first sample account features and label information of the plurality of sample accounts to obtain second sample account features of the plurality of sample accounts; training by adopting second sample account characteristics of a plurality of sample accounts to obtain a target classification model;
the identification unit is configured to input the second account characteristics of the account into a target classification model, and predict whether the account is a target type account or not through the target classification model to obtain an identification result of the account.
9. The account identification device according to claim 8, wherein the obtaining unit is configured to perform stitching of a user profile feature, a login feature, and a user behavior feature of the account, so as to obtain a first account feature of the account.
10. The account identification device according to claim 9, wherein the obtaining unit is configured to encode each feature to obtain an encoded feature, and splice the encoded features to obtain a first account feature of the account.
11. The account identification device according to claim 10, wherein the obtaining unit is configured to, when encoding a target feature, segment the target feature to obtain multiple segments of sub-features of the target feature, encode the multiple segments of sub-features respectively, and splice encoding results to obtain the encoded target feature.
12. The account identification device according to claim 8, wherein the determining unit is configured to perform inputting the first account feature into a clustering model, and obtain, by the clustering model, a first category of the account and category features of the account according to a distance relationship between the first account feature and a plurality of clusters.
13. The account identification device according to claim 8, wherein the determining unit performs parallel computation by a GPU when determining a category based on the first account feature.
14. The account number recognition device of claim 8, wherein the device further comprises:
a feature processing unit configured to perform a calculation of weights of input sample features of the plurality of sample accounts by a tree model; deleting the characteristics with the weight smaller than the target weight, and splicing the rest characteristics into the first sample account characteristics of the plurality of sample accounts.
15. A server, comprising:
a processor; a memory for storing the processor-executable instructions;
wherein the processor is configured to execute the instructions to implement the account identification method of any one of claims 1 to 7.
16. A storage medium, which when executed by a processor of a server, causes the server to perform the account identification method of any one of claims 1 to 7.
CN201911136455.5A 2019-11-19 2019-11-19 Account identification method, device, server and storage medium Active CN112905987B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911136455.5A CN112905987B (en) 2019-11-19 2019-11-19 Account identification method, device, server and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911136455.5A CN112905987B (en) 2019-11-19 2019-11-19 Account identification method, device, server and storage medium

Publications (2)

Publication Number Publication Date
CN112905987A CN112905987A (en) 2021-06-04
CN112905987B true CN112905987B (en) 2024-02-27

Family

ID=76104647

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911136455.5A Active CN112905987B (en) 2019-11-19 2019-11-19 Account identification method, device, server and storage medium

Country Status (1)

Country Link
CN (1) CN112905987B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117407800A (en) * 2023-09-11 2024-01-16 北京工商大学 Social media robot detection method and system based on random forest and XGBoost model

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106503562A (en) * 2015-09-06 2017-03-15 阿里巴巴集团控股有限公司 A kind of Risk Identification Method and device
CN108418825A (en) * 2018-03-16 2018-08-17 阿里巴巴集团控股有限公司 Risk model training, rubbish account detection method, device and equipment
CN109525595A (en) * 2018-12-25 2019-03-26 广州华多网络科技有限公司 A kind of black production account recognition methods and equipment based on time flow feature
CN110119860A (en) * 2018-02-05 2019-08-13 阿里巴巴集团控股有限公司 A kind of rubbish account detection method, device and equipment
CN110198310A (en) * 2019-05-20 2019-09-03 腾讯科技(深圳)有限公司 A kind of anti-cheat method of network behavior, device and storage medium
CN110225036A (en) * 2019-06-12 2019-09-10 北京奇艺世纪科技有限公司 A kind of account detection method, device, server and storage medium
CN110232630A (en) * 2019-05-29 2019-09-13 腾讯科技(深圳)有限公司 The recognition methods of malice account, device and storage medium
CN110399925A (en) * 2019-07-26 2019-11-01 腾讯科技(武汉)有限公司 Risk Identification Method, device and the storage medium of account

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106503562A (en) * 2015-09-06 2017-03-15 阿里巴巴集团控股有限公司 A kind of Risk Identification Method and device
CN110119860A (en) * 2018-02-05 2019-08-13 阿里巴巴集团控股有限公司 A kind of rubbish account detection method, device and equipment
CN108418825A (en) * 2018-03-16 2018-08-17 阿里巴巴集团控股有限公司 Risk model training, rubbish account detection method, device and equipment
CN109525595A (en) * 2018-12-25 2019-03-26 广州华多网络科技有限公司 A kind of black production account recognition methods and equipment based on time flow feature
CN110198310A (en) * 2019-05-20 2019-09-03 腾讯科技(深圳)有限公司 A kind of anti-cheat method of network behavior, device and storage medium
CN110232630A (en) * 2019-05-29 2019-09-13 腾讯科技(深圳)有限公司 The recognition methods of malice account, device and storage medium
CN110225036A (en) * 2019-06-12 2019-09-10 北京奇艺世纪科技有限公司 A kind of account detection method, device, server and storage medium
CN110399925A (en) * 2019-07-26 2019-11-01 腾讯科技(武汉)有限公司 Risk Identification Method, device and the storage medium of account

Also Published As

Publication number Publication date
CN112905987A (en) 2021-06-04

Similar Documents

Publication Publication Date Title
CN112633962B (en) Service recommendation method and device, computer equipment and storage medium
CN110855648B (en) Early warning control method and device for network attack
CN113011889B (en) Account anomaly identification method, system, device, equipment and medium
CN111260220B (en) Group control equipment identification method and device, electronic equipment and storage medium
WO2023169274A1 (en) Data processing method and device, and storage medium and processor
CN114693192A (en) Wind control decision method and device, computer equipment and storage medium
CN110414335A (en) Video frequency identifying method, device and computer readable storage medium
CN113298263A (en) Calculation graph processing method and device, model running method and device, electronic equipment, server and edge terminal
CN113592593A (en) Training and application method, device, equipment and storage medium of sequence recommendation model
CN111970400A (en) Crank call identification method and device
CN115687732A (en) User analysis method and system based on AI and stream computing
CN110969261B (en) Encryption algorithm-based model construction method and related equipment
CN115860836A (en) E-commerce service pushing method and system based on user behavior big data analysis
CN114647790A (en) Big data mining method and cloud AI (Artificial Intelligence) service system applied to behavior intention analysis
CN112905987B (en) Account identification method, device, server and storage medium
CN114092162B (en) Recommendation quality determination method, and training method and device of recommendation quality determination model
CN111401675A (en) Similarity-based risk identification method, device, equipment and storage medium
CN115756821A (en) Online task processing model training and task processing method and device
CN111737319B (en) User cluster prediction method, device, computer equipment and storage medium
CN111931035B (en) Service recommendation method, device and equipment
CN113469816A (en) Digital currency identification method, system and storage medium based on multigroup technology
CN117216803B (en) Intelligent finance-oriented user information protection method and system
US20230377004A1 (en) Systems and methods for request validation
CN117932455A (en) Internet of things asset identification method and system based on neural network
CN113313587A (en) Credit risk analysis method, device, equipment and medium based on artificial intelligence

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant