Detailed Description
The embodiment of the specification provides a method, a device and equipment for detecting a garbage account.
In order to make the technical solutions in the present specification better understood by those skilled in the art, the technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is obvious that the described embodiments are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, shall fall within the scope of the present application.
In the wind control system of most service platforms, the detection of the garbage account number has important significance in wind control safety. Generally, whether or not the account involves fraud, an account with an abnormal usage may be regarded as a junk account, such as an account automatically registered in a large amount by a machine in the background art. If a device registers multiple accounts, it is desirable to determine whether the device is a spam account according to the actions after the accounts, for example, if the device is in normal online shopping, the device can be determined to be a non-spam account, but in practical application, a sufficient criterion can often be obtained after the account is registered for several months, and lawbreakers are likely to use the account to perform actions such as fraud during the period, so it is important to design a scheme capable of detecting the spam account as early as possible.
Many useful information can be obtained in the account media network diagram, and further the accuracy of account feature representation is improved through diagram calculation. Based on this, the embodiment of the specification proposes a feature vector representation scheme of an account media network graph node based on unsupervised learning, and further a risk model training scheme based on supervised learning, and the whole process combines unsupervised learning and supervised learning, which can be called as a semi-supervised learning process. The account intermediary network graph is a heterogeneous network graph, and heterogeneous refers to differences in node properties, for example, some nodes in the graph may represent accounts, and some nodes may represent account related intermediaries such as devices, IP networks, and the like.
Fig. 1 is a schematic diagram of an overall architecture involved in a practical application scenario according to the scheme of the present disclosure. The whole architecture mainly relates to an unsupervised learning server and a supervised learning server. And acquiring an account media network diagram reflecting the registration relationship by an unsupervised learning server, determining the feature vector of the nodes in the diagram by unsupervised learning, and training a risk model by the supervised learning server according to the feature vector of part of the nodes and the risk labeling data to detect the junk account by supervised learning.
The account media network graph may be generated by an unsupervised learning server or other device, and the risk annotation data may be generated by a supervised learning server or other device, or manually annotated. The unsupervised learning server and the supervised learning server may be the same server.
The scheme of the present specification is described in detail below based on the exemplary architecture in fig. 1.
Fig. 2 is a flow chart of a method for detecting a spam account number according to an embodiment of the present disclosure. The flow in fig. 2 may include the steps of:
s202: and acquiring an account media network diagram generated according to the account historical data, wherein nodes in the account media network diagram represent accounts or media, and at least part of edges represent registration relations among the connected nodes.
In the embodiment of the present specification, the account history data may include data at the time of account registration, for example, registration information filled in at the time of account registration, etc. by which medium the account is registered; the account history data may also include behavior data after account registration, such as transaction information, login information, and the like of the account. For newly registered accounts, data during account registration may be mainly used to detect a junk account as early as possible, and some embodiments are mainly described below by taking this case as an example, where, for example, an account media network diagram may be generated based on account registration situations in a past period of time.
In the embodiments of the present disclosure, the account is registered through a medium or implements a subsequent behavior, where the medium is, for example, a device, an IP network, a physical address, etc., and in some embodiments, the medium is a device, and the account is described by taking the device registration as an example, where the account media network diagram is specifically an account device network diagram.
When the network diagram of the account equipment is generated, each account and each equipment to be represented can be determined first, each different account to be represented is represented by one node respectively, each different equipment can be represented by one node respectively, and then any node represents either the account or the equipment. Further, if the two nodes have a registration relationship, an edge representing the registration relationship is established between the two nodes, so as to generate an account device network diagram.
Here, the registration relationship mainly refers to a relationship between an account and a device, and if an account is registered by a device, the account and the device have a registration relationship. In practical applications, if there is a demand, the specific meaning of the registration relationship may be widened, for example, the registration relationship may further include a relationship between an account and another account, and if a certain account and another account are registered through the same device, the account and the other account have a registration relationship.
In the embodiment of the present disclosure, the account media network graph may be an undirected graph or a directed graph, which is not specifically limited herein. Generally, if only the registration relationship is reflected, an undirected graph is adopted; if more relationships such as transaction relationships are reflected, a directed graph may be used in which the direction of the business relationship is indicated by the direction of the edges, e.g., if node a represents a buyer account and node B represents a seller account, the edge representing the transaction relationship between node a and node B may be the node B designated from node a, which may also reflect the direction of funds flow.
S204: and determining the feature vector of the node in the account media network diagram through unsupervised learning.
In the embodiment of the present disclosure, for a node in the obtained account media network diagram, if a corresponding feature vector is not yet established, an initialized feature vector may be established for the node according to a set rule, where the feature vector at this time cannot accurately represent the real feature of the node yet. Through the unsupervised learning, the current feature vector of the node can be trained iteratively, so as to determine a feature vector capable of more accurately representing the real feature of the node, that is, the feature vector described in step S204.
In the embodiment of the present disclosure, through unsupervised learning, the node representing the account number and the node representing the medium may be mapped to the same vector space, so as to provide more uniform samples for subsequent training.
S206: and training a risk model according to the feature vectors and the risk labeling data of part of the nodes to detect the junk account.
In the embodiment of the present disclosure, for each node in the account media network diagram, some nodes may represent a spam account, some nodes may represent a device for registering the spam account, which are not yet clear, and at least some of them need to be clear by a specific means to obtain a training sample with a training tag, so that the risk model can be trained by supervised learning. The specific means is not particularly limited herein, and may be, for example, based on sampling accurate tracking analysis, or may be based on user reporting, or the like.
Through the definite at least partial situations, the risk labeling data can be labeled for part of the nodes in advance or in real time, and the risk labeling data can indicate the risks existing in the nodes, for example, whether the nodes represent junk accounts, whether the nodes represent media with registered junk accounts (which can be specific to whether the fully registered junk accounts, whether the registered junk accounts exceed the set number, and the like) or the like. In practical applications, the risk is not limited to the content related to the spam account, and may also represent the risk of being vulnerable to attacks existing in the normal account, and the like. The training label can be obtained according to risk marking data, and generally, the risk marking data can be directly used as the training label.
The form of representation of the risk annotation data is varied and is not particularly limited herein. For example, if it is determined that a certain node is irrelevant to a spam account, risk labeling data of the node may be denoted as 1, whereas if it is determined that a certain node represents a spam account or represents a medium in which a spam account is registered, risk labeling data of the node may be denoted as 0; etc.
In the embodiment of the present disclosure, after the risk model is trained, the risk model may be used for classification or regression to predict the risk property of the input data.
For example, the input data may be a feature vector corresponding to the account to be detected or the medium to be detected, and the trained risk model is processed to output a corresponding classification result or probability value, so as to determine whether the account to be detected is a spam account, or determine whether the medium to be detected is registered with the spam account, or the like. The account number to be detected and the medium to be detected can be represented by nodes in the account number medium network diagram or can be outside the account number medium network diagram; in the former case, the feature vector corresponding to the account to be detected or the medium to be detected is already determined, so that the detection can be directly performed, and in the latter case, the feature vector corresponding to the account to be detected or the medium to be detected may not be determined yet, and then the detection can be performed after the feature vector is determined by adopting the scheme of the present specification.
Of course, according to the specific content of the risk labeling data, the risk model may be used for predicting risks in other aspects besides detecting the garbage account, and the principles are the same and will not be repeated here.
According to the method of FIG. 2, for an unsupervised learning process, based on an account media network graph, the feature vector of the node can be efficiently and accurately calculated and determined through the graph, and then the supervised learning process is combined, so that a risk model can be accurately trained even under the condition of less risk labeling data, and further, a garbage account can be effectively detected by utilizing the trained risk model.
The examples of the present specification also provide some specific embodiments of the method, and extensions thereof, based on the method of fig. 2, as described below.
In the embodiment of the present disclosure, as can be seen from the above example, for step S206, the detecting the garbage account number may specifically include: inputting the feature vector corresponding to the account to be detected into the trained risk model; and judging whether the account to be detected is a junk account or not according to the classification result or the probability value output by the trained risk model.
In the embodiment of the present specification, the unsupervised learning algorithm is various, and the present specification provides an unsupervised learning algorithm that can be used to determine feature vectors, and of course, other algorithms, such as a clustering algorithm, can be used to implement unsupervised learning. The unsupervised learning algorithm will be mainly described below.
The principle of the unsupervised learning algorithm is as follows: it is considered that in the account media network graph, the similarity between the feature vector of the node (called the current node) and the feature vector of the adjacent node should be relatively high, and the similarity between the feature vector of the current node and at least some nodes other than the adjacent node should be relatively lower, so as to achieve at least one of the two cases as the target training feature vector.
The determination modes of the nearby nodes are various and can be determined according to requirements. For example, for any node, the neighboring nodes may refer to no more than a node from which a set number of hops can reach, where each hop can jump from a current node to a neighboring node of the current node, and the set number of hops is set to, for example, 5 hops or 3 hops; for another example, for any node, the nodes near the node may refer to other nodes in a circular area with the node as a center of a circle and a set radius; etc. The following embodiments mainly take the former determination mode as an example.
In this embodiment of the present disclosure, for step S204, the determining, by unsupervised learning, a feature vector of a node in the account media network map may specifically include: determining nodes nearby nodes in the account media network diagram; determining the current feature vector of the node and the nodes nearby the node, wherein the current feature vector is obtained by initializing according to a set rule at the beginning; training the current feature vector by using a specified loss function through unsupervised learning, so as to determine the feature vector of the node after training; wherein the nearby nodes include no more nodes from which the set number of hops can be reached.
Further, the training the current feature vector may specifically include: and training the current feature vector by taking the similarity between the current feature vector of the node and the nearby nodes as a target. The measure of similarity between vectors is varied, for example, based on a vector click metric, based on a spatial distance metric between vectors, and so on.
The embodiment of the specification also provides a specified loss function for training feature vectors, wherein the loss function measures the similarity between vectors by vector clicking, and the method comprises the following steps:
wherein, the liquid crystal display device comprises a liquid crystal display device,
a feature vector representing the current node w, T (w) representing a node set constituted by nodes in the vicinity of the node w, U (w) representing a node set constituted by at least some of the nodes other than the nodes in the vicinity of the node w, and>
feature vector representing node in T (w), A>
Feature vector representing node in U (w), sigma represents excitation function, lambda represents super parameter, E
c'∈U(w) Representing the desired function when c' meets the specified probability distribution.
Of course, the loss function in the above example is exemplary, and can be adjusted as long as it can be adapted to the above training objectives.
In this embodiment of the present disclosure, for step S206, training the risk model according to the feature vector and the risk labeling data of the partial nodes may specifically include: and taking the feature vectors of part of the nodes as model input data, and taking corresponding risk labeling data as supervised learning training labels to train a risk model.
When the scheme of the specification is applied to an on-line scene, the requirements on the speed of model training and the speed of subsequent training are higher, and a proper risk model can be selected according to the requirements, for example, a Logistic regression model with higher speed can be adopted as the risk model, and the method is further described below by taking the example.
In this embodiment of the present disclosure, the Logistic regression model may be specifically implemented by using a sigmoid function, for example, the risk labeling data of the node may be set to 1 or 0 as described above, so as to form a classification problem, and the prediction function of the Logistic regression model may be defined according to the sigmoid function as follows:
wherein θ represents the parameter to be trained of the Logistic regression model, x i Representing the feature vector of the node numbered i.
Further, the loss function of constructing the Logistic regression model is as follows:
wherein y is i Risk annotation data representing the node numbered i.
And training the Logistic regression model by utilizing the feature vectors and the risk labeling data of part of the nodes.
The above description has been made on a method for detecting a spam account number provided in the embodiment of the present specification, and based on the same concept, the embodiment of the present specification further provides a corresponding device and apparatus, as shown in fig. 3 and fig. 4.
Fig. 3 is a schematic structural diagram of a garbage account detection device corresponding to fig. 2 according to an embodiment of the present disclosure, where the device may be located in an execution body of the flow in fig. 2, and includes:
the obtaining module 301 obtains an account media network diagram generated according to account historical data, wherein nodes in the account media network diagram represent accounts or media, and at least part of edges represent registration relations among the connected nodes;
the determining module 302 determines feature vectors of nodes in the account media network diagram through unsupervised learning;
the training detection module 303 trains a risk model according to the feature vectors and the risk labeling data of part of the nodes, and is used for detecting the spam account.
Optionally, the medium comprises a device.
Optionally, the training detection module 303 detects a spam account number, specifically including:
the training detection module 303 inputs the feature vector corresponding to the account to be detected into the trained risk model;
and judging whether the account to be detected is a junk account or not according to the classification result or the probability value output by the trained risk model.
Optionally, the determining module 302 determines, through unsupervised learning, feature vectors of nodes in the account media network map, including:
the determining module 302 determines nodes in the vicinity of nodes in the account media network graph;
determining the current feature vector of the node and the nodes nearby the node, wherein the current feature vector is obtained by initializing according to a set rule at the beginning;
training the current feature vector by using a specified loss function through unsupervised learning, so as to determine the feature vector of the node after training;
wherein the nearby nodes include no more nodes from which the set number of hops can be reached.
Optionally, the determining module 302 trains the current feature vector, specifically includes:
the determining module 302 trains the current feature vector with the objective of improving the similarity between the current feature vectors of the node and the nodes nearby.
Optionally, if the similarity between vectors is measured based on a vector dot product, the specified loss function includes:
wherein, the liquid crystal display device comprises a liquid crystal display device,
a feature vector representing the current node w, T (w) representing a node set constituted by nodes in the vicinity of the node w, U (w) representing a node set constituted by at least some of the nodes other than the nodes in the vicinity of the node w, and>
feature vector representing node in T (w), A>
Feature vector representing node in U (w), sigma represents excitation function, lambda represents super parameter, E
c'∈U(w) Representing the desired function when c' meets the specified probability distribution.
Optionally, the training detection module 303 trains a risk model according to the feature vectors and risk labeling data of part of the nodes, and specifically includes:
the training detection module 303 takes the feature vectors of part of the nodes as model input data, and takes corresponding risk labeling data as supervised learning training labels to train a Logistic regression model.
Fig. 4 is a schematic structural diagram of a garbage account detection device corresponding to fig. 2 according to an embodiment of the present disclosure, where the device includes:
at least one processor; the method comprises the steps of,
a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,
the memory stores instructions executable by the at least one processor to enable the at least one processor to:
acquiring an account media network diagram generated according to account historical data, wherein nodes in the account media network diagram represent accounts or media, and at least part of edges represent registration relations among the connected nodes;
determining the feature vector of the node in the account media network diagram through unsupervised learning;
and training a risk model according to the feature vectors and the risk labeling data of part of the nodes to detect the junk account.
Based on the same idea, the embodiments of the present specification further provide a corresponding non-volatile computer storage medium, storing computer executable instructions, where the computer executable instructions are configured to:
acquiring an account media network diagram generated according to account historical data, wherein nodes in the account media network diagram represent accounts or media, and at least part of edges represent registration relations among the connected nodes;
determining the feature vector of the node in the account media network diagram through unsupervised learning;
and training a risk model according to the feature vectors and the risk labeling data of part of the nodes to detect the junk account.
The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for apparatus, devices, non-volatile computer storage medium embodiments, the description is relatively simple, as it is substantially similar to method embodiments, with reference to the section of the method embodiments being relevant.
The apparatus, the device, the nonvolatile computer storage medium and the method provided in the embodiments of the present disclosure correspond to each other, and therefore, the apparatus, the device, the nonvolatile computer storage medium also have similar advantageous technical effects as those of the corresponding method, and since the advantageous technical effects of the method have been described in detail above, the advantageous technical effects of the corresponding apparatus, device, and nonvolatile computer storage medium are not described herein again.
In the 90 s of the 20 th century, improvements to one technology could clearly be distinguished as improvements in hardware (e.g., improvements to circuit structures such as diodes, transistors, switches, etc.) or software (improvements to the process flow). However, with the development of technology, many improvements of the current method flows can be regarded as direct improvements of hardware circuit structures. Designers almost always obtain corresponding hardware circuit structures by programming improved method flows into hardware circuits. Therefore, an improvement of a method flow cannot be said to be realized by a hardware entity module. For example, a programmable logic device (Programmable Logic Device, PLD) (e.g., field programmable gate array (Field Programmable Gate Array, FPGA)) is an integrated circuit whose logic function is determined by the programming of the device by a user. A designer programs to "integrate" a digital system onto a PLD without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Moreover, nowadays, instead of manually manufacturing integrated circuit chips, such programming is mostly implemented by using "logic compiler" software, which is similar to the software compiler used in program development and writing, and the original code before the compiling is also written in a specific programming language, which is called hardware description language (Hardware Description Language, HDL), but not just one of the hdds, but a plurality of kinds, such as ABEL (Advanced Boolean Expression Language), AHDL (Altera Hardware Description Language), confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware Description Language), lava, lola, myHDL, PALASM, RHDL (Ruby Hardware Description Language), etc., VHDL (Very-High-Speed Integrated Circuit Hardware Description Language) and Verilog are currently most commonly used. It will also be apparent to those skilled in the art that a hardware circuit implementing the logic method flow can be readily obtained by merely slightly programming the method flow into an integrated circuit using several of the hardware description languages described above.
The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application specific integrated circuits (Application Specific Integrated Circuit, ASIC), programmable logic controllers, and embedded microcontrollers, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, atmel AT91SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic of the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller may thus be regarded as a kind of hardware component, and means for performing various functions included therein may also be regarded as structures within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.
The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each element may be implemented in one or more software and/or hardware elements when implemented in the present specification.
It will be appreciated by those skilled in the art that the present description may be provided as a method, system, or computer program product. Accordingly, the present specification embodiments may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present description embodiments may take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
The present description is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the specification. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.
Those skilled in the art will appreciate that the present embodiments may be provided as a method, system, or computer program product. Accordingly, the present specification may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
The description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.
The foregoing is merely exemplary embodiments of the present disclosure and is not intended to limit the present disclosure. Various modifications and changes may be made to the present application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc. which are within the spirit and principles of the present application are intended to be included within the scope of the claims of the present application.