CN110324352B - Method and device for identifying batch registered account groups - Google Patents
Method and device for identifying batch registered account groups Download PDFInfo
- Publication number
- CN110324352B CN110324352B CN201910622891.7A CN201910622891A CN110324352B CN 110324352 B CN110324352 B CN 110324352B CN 201910622891 A CN201910622891 A CN 201910622891A CN 110324352 B CN110324352 B CN 110324352B
- Authority
- CN
- China
- Prior art keywords
- similarity
- accounts
- account
- registration
- network protocol
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1408—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
- H04L63/1425—Traffic logging, e.g. anomaly detection
Landscapes
- Engineering & Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Computer Security & Cryptography (AREA)
- Theoretical Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- Bioinformatics & Computational Biology (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computer Hardware Design (AREA)
- Computing Systems (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
The invention relates to the technical field of big data wind control, in particular to a method and a device for identifying batch registered account groups. The method comprises the following steps: acquiring a network protocol address, registration time, a registration source, an operation behavior and operation time corresponding to the operation behavior of each account in a plurality of accounts; determining similarity between every two account numbers in the plurality of account numbers; performing density-based clustering on the plurality of account numbers according to the similarity between every two account numbers; and determining account groups of which the number of accounts is greater than a batch registration threshold value in the account groups obtained after clustering as batch registration account groups. According to the method and the device, the synchronism among the accounts is analyzed through the process, even if the device identification is tampered by a network black product, which accounts in the accounts belong to the batch registered accounts can be accurately identified, and the identification accuracy of the batch registered accounts is improved.
Description
Technical Field
The invention relates to the technical field of big data wind control, in particular to a method and a device for identifying batch registered account groups.
Background
The network black production refers to an illegal behavior which takes the internet as a medium and a network technology as a main means and threatens the safety of a computer information system and the management order of a network space. On a live broadcast platform, a network black product registers accounts in batches in a short time by using a computer program for achieving the purpose, and the action is called batch registration. The batch registration account groups obtained by batch registration are generally used for illegal activities. Therefore, in order to ensure internet security, the batch registered account group needs to be accurately identified.
Since a batch registered account group usually shares limited device resources, which inevitably causes a phenomenon that a plurality of accounts share one device, the prior art determines whether the plurality of accounts belong to the batch registered account group by identifying whether the plurality of accounts share one device. However, if the network black yields tamper the device identifiers by adopting the identification method, the device reuse rate between the accounts is reduced, the batch registered account groups are difficult to find, and the problem of low identification accuracy of the batch registered accounts exists.
Disclosure of Invention
In view of the above problems, the present invention is proposed to provide a method and apparatus for identifying a batch registered account group that overcomes or at least partially solves the above problems.
According to a first aspect of the present invention, there is provided a method for identifying a batch registered account group, the method comprising:
acquiring a network protocol address, registration time, a registration source, an operation behavior and operation time corresponding to the operation behavior of each account in a plurality of accounts;
determining similarity between every two account numbers in the plurality of account numbers;
performing density-based clustering on the plurality of account numbers according to the similarity between every two account numbers;
determining account groups of which the number of accounts is larger than a batch registration threshold value in the account groups obtained after clustering as batch registration account groups;
determining the similarity between the two account numbers comprises the following steps:
performing network protocol address similarity calculation based on the network protocol addresses of the two accounts to obtain a first similarity;
calculating the similarity of the registration time based on the registration time of the two accounts to obtain a second similarity;
performing registration source similarity calculation based on the registration sources of the two account numbers to obtain a third similarity;
calculating the similarity of the operation behaviors based on the operation behaviors of the two accounts and the operation time corresponding to the operation behaviors to obtain a fourth similarity;
and determining the similarity between the two account numbers based on the first similarity, the second similarity, the third similarity, the fourth similarity and the weight coefficients corresponding to the similarities.
Preferably, the calculating of the similarity of network protocol addresses based on the network protocol addresses of the two accounts to obtain the first similarity includes the following formula:
wherein IP-sim (u, v) is the first similarity, IPuIs the network protocol address, IP, of the first of the two accountsvIs the network protocol address of the second of the two accounts.
Preferably, the calculating of the similarity of the registration time based on the registration time of the two accounts to obtain the second similarity includes the following formula:
wherein time-sim (u, v) is the second degree of similarity, tuIs the registration time, t, of the first of the two accountsvIs the registration time of the second of the two accounts.
Preferably, the calculating of the similarity of the registered sources based on the two registered sources of the account to obtain the third similarity includes the following formula:
src-sim(u,v)=I(srcu=srcv)
wherein src-sim (u, v) is the third degree of similarity, srcuSrc, the registered origin of the first of the two accountsvIs the registered source of the second of the two accounts, I is an indicative function, I (src)u=srcv) Denotes if srcu=srcvThen I takes 1, otherwise I takes 0.
Preferably, the calculating of the similarity of the operation behaviors based on the operation behaviors of the two account numbers and the operation time corresponding to the operation behaviors to obtain a fourth similarity includes the following formula:
wherein behavior-sim (u, v) is the fourth similarity, Δ tiThe operation time difference sequence comprises the minimum value of the operation time difference of the two account numbers under the same operation behavior, and s is the total length of the operation time difference sequence.
According to a second aspect of the present invention, there is provided an apparatus for identifying a batch registered account group, the apparatus comprising:
the acquisition module is used for acquiring a network protocol address, registration time, a registration source, an operation behavior and operation time corresponding to the operation behavior of each account in a plurality of accounts;
the first determination module is used for determining the similarity between every two account numbers in the plurality of account numbers;
the clustering module is used for carrying out density-based clustering on the plurality of account numbers according to the similarity between every two account numbers;
the second determining module is used for determining account groups with the account number larger than the batch registration threshold value in the clustered account groups as batch registration account groups;
wherein the first determining module comprises:
a first obtaining unit, configured to perform network protocol address similarity calculation based on network protocol addresses of the two accounts to obtain a first similarity;
the second obtaining unit is used for calculating the similarity of the registration time based on the registration time of the two accounts to obtain a second similarity;
a third obtaining unit, configured to perform registration source similarity calculation based on the registration sources of the two account numbers to obtain a third similarity;
a fourth obtaining unit, configured to perform operation behavior similarity calculation based on the operation behaviors of the two accounts and the operation time corresponding to the operation behaviors, so as to obtain a fourth similarity;
and the determining unit is used for determining the similarity between the two account numbers based on the first similarity, the second similarity, the third similarity, the fourth similarity and the weight coefficients corresponding to the similarities.
Preferably, the first obtaining unit includes the following formula:
wherein IP-sim (u, v) is the first similarity, IPuIs the network protocol address, IP, of the first of the two accountsvIs the network protocol address of the second of the two accounts.
Preferably, the second unit includes the following formula:
wherein time-sim (u, v) is the second degree of similarity, tuIs the registration time, t, of the first of the two accountsvIs the registration time of the second of the two accounts.
According to a third aspect of the present invention, a computer-readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the method steps as in the first aspect described above.
According to a fourth aspect of the present invention, there is provided a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method steps as in the first aspect when executing the program.
According to the method and the device for identifying the batch registered account group, the network protocol address, the registration time, the registration source, the operation behavior and the operation time corresponding to the operation behavior of each account in the plurality of accounts are obtained. Next, a similarity between each two of the plurality of accounts is determined. And then, according to the similarity between every two account numbers, clustering the account numbers based on density. And finally, determining account groups with the account number larger than the batch registration threshold value in the clustered account groups as batch registration account groups. The process of determining the similarity between two accounts comprises the following steps: and calculating the similarity of the network protocol addresses based on the network protocol addresses of the two accounts to obtain a first similarity. And calculating the similarity of the registration time based on the registration time of the two accounts to obtain a second similarity. And calculating the similarity of the registered sources based on the registered sources of the two account numbers to obtain a third similarity. And calculating the similarity of the operation behaviors based on the operation behaviors of the two accounts and the operation time corresponding to the operation behaviors to obtain a fourth similarity. And determining the similarity between the two account numbers based on the first similarity, the second similarity, the third similarity, the fourth similarity and the weight coefficients corresponding to the similarities. Through the process, the synchronism among the accounts is analyzed, even if the equipment identification is tampered by a network black product, which accounts in the accounts belong to the batch registered accounts can be still accurately identified, and the identification accuracy of the batch registered accounts is improved.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
fig. 1 shows a flowchart of a method for identifying a batch registered account group in a first embodiment of the present invention;
FIG. 2 shows a flow chart of step 102 in a first embodiment of the invention;
fig. 3 is a schematic structural diagram illustrating an apparatus for identifying a batch registered account group according to a second embodiment of the present invention;
fig. 4 shows a block diagram of a computer apparatus in a fourth embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
A first embodiment of the present invention provides a method for identifying batch registered account groups, where the method is intended to: for a plurality of accounts, identifying which accounts belong to a batch registration account group obtained through batch registration. As shown in fig. 1, the method includes:
step 101: the method comprises the steps of obtaining a network protocol address, registration time, a registration source, an operation behavior and operation time corresponding to the operation behavior of each account in a plurality of accounts.
Step 102: a similarity between each two of the plurality of accounts is determined.
Step 103: and performing density-based clustering on the plurality of account numbers according to the similarity between every two account numbers.
Step 104: and determining account groups of which the number of accounts is greater than a batch registration threshold value in the account groups obtained after clustering as batch registration account groups.
For step 101, for a plurality of accounts to be identified, first, a network protocol address, registration time, a registration source, an operation behavior, and an operation time corresponding to the operation behavior of each account are obtained. The network protocol address is also the IP of the account. The registration time is the time when the account is registered. The registration source is a source of the account at the time of registration, and is, for example, from a website, from an application, from WeChat, from QQ, or the like. The operation behavior is operation behavior generated after the account is registered, such as login behavior, check-in behavior, webpage browsing behavior and the like. For each operation behavior, the background or the server records the operation time for executing the operation behavior. For example, the operation time for the login behavior is X year, X month, X day 09: 00.
In the embodiment of the present invention, if the plurality of accounts are the first account, the second account, and the third account, respectively. Step 101 is executed to acquire a network protocol address, registration time, a registration source, an operation behavior of the first account, and operation time corresponding to the operation behavior, acquire a network protocol address, registration time, a registration source, an operation behavior of the second account, and operation time corresponding to the operation behavior, and acquire a network protocol address, registration time, a registration source, an operation behavior of the third account, and operation time corresponding to the operation behavior.
For step 102, for multiple accounts, the similarity between every two accounts is calculated. For example, for a first account, a second account, and a third account, the similarity between the first account and the second account, the similarity between the second account and the third account, and the similarity between the first account and the third account are calculated, respectively.
Further, as to how to calculate the similarity between two accounts, as shown in fig. 2, the method includes the following steps:
step 201: and calculating the similarity of the network protocol addresses based on the network protocol addresses of the two accounts to obtain a first similarity.
Step 202: and calculating the similarity of the registration time based on the registration time of the two accounts to obtain a second similarity.
Step 203: and calculating the similarity of the registered sources based on the registered sources of the two account numbers to obtain a third similarity.
Step 204: and calculating the similarity of the operation behaviors based on the operation behaviors of the two accounts and the operation time corresponding to the operation behaviors to obtain a fourth similarity.
Step 205: and determining the similarity between the two account numbers based on the first similarity, the second similarity, the third similarity, the fourth similarity and the weight coefficients corresponding to the similarities.
To more clearly explain the above process, two account numbers are defined as a first account number u and a second account number v, respectively. How to calculate the similarity between the first account and the second account will be described in detail below.
For step 201, the network Protocol Address (IP) of the account is typically four characters, such as 110.18.82.28. The difference between the network protocol addresses of the two accounts can be converted into a value by performing step 201. This value can characterize the similarity between the network protocol addresses of the two accounts. Specifically, a first similarity between the first account and the second account is obtained by using the following formula one:
wherein IP-sim (u, v) is the first similarity, IPuIs the network protocol address, IP, of the first accountvIs the network protocol address of the second account. ip2Long () is a conversion function for converting character-type data into numerical-type data, 232Is the largest numerical IP value.
For equation one, the range of values for long shaping IP is from 0 to 232-1, so the denominator in equation one is the maximum possible difference between the two long shaping IP values. I.e. a denominator of 232. For a molecule, the difference between the two long shaping IP values represents the value interval. To normalize the interval of values to [0, 1%]The IP value interval is divided by the maximum value of the interval. Further, the result obtained by the above process is that the two long shaping IPs take normalized distance, and to calculate the similarity, it is necessary to subtract the result by 1, so that the similarity of the part is still [0,1 ]]Finally, the formula one is obtained.
The similarity between the two account network protocol addresses determined by the formula I can be more accurately determined compared with character type network protocol addresses which are directly compared with the two account network protocol addresses. For example, IP for two accounts: 39.180.27.112 and 39.180.27.115, which are directly contrasted, the two IPs are different, but the three first segments of the IPs are identical, and may be under the same network, so the similarity may be high. Through the formula one, the similarity of the two IPs is calculated to be 0.999, and therefore the correlation between the two IPs can be accurately described. The greater the first similarity, the closer the IPs of the two accounts, and thus the more similar the two accounts. Conversely, the smaller the first similarity is, the larger the difference between the IP of the two accounts is, and the more dissimilar the two accounts are.
For step 202, the difference between the registration times of the two accounts can be converted to a value by performing step 202. This value can characterize the similarity between registration times of two accounts. Specifically, the following formula two is used to obtain a second similarity between the first account and the second account:
wherein time-sim (u, v) is the second degree of similarity, tuIs the registration time, t, of the first accountvIs the registration time of the second account.
For the second formula, the relationship between the interval of the two account registration times and the similarity is nonlinear, so the second formula measures the similarity between the two account registration times by an index. The smaller the interval between the registration time of two accounts is, the higher the possibility that the two accounts are simultaneously registered through a batch script is, and the calculated similarity is high at the moment so as to reflect the situation. When the interval between the registration times of two accounts is large, the possibility of association between the two accounts is rapidly reduced. And 0.01 in the formula II is a weight coefficient, and the value is obtained based on data fitting in the current service, namely, the observed similarity between the accounts registered through the batch script is set to be 1, the similarity of the normally registered accounts is set to be 0, and the weight coefficient is fitted through linear regression.
The difference between the two account registration times can be accurately determined through the second formula, and compared with the prior art, the description precision of the similarity is improved. The greater the second similarity, the closer the registration time of the two accounts and thus the more similar the two accounts are. Conversely, the smaller the second similarity is, the longer the difference between the registration times of the two accounts is, and the more dissimilar the two accounts are.
For step 203, the difference between the registered sources of the two accounts can be converted to a value by performing step 203. This value can characterize the similarity between the registered sources of the two accounts. Specifically, the following formula three is adopted to obtain a third similarity between the first account and the second account:
src-sim(u,v)=I(srcu=srcv) Formula three
Wherein src-sim (u, v) is the third degree of similarity, srcuSrc is the registered source of the first accountvIs the registered source of the second account, I is an indicative function, I (src)u=srcv) Denotes if srcu=srcvThen I takes 1, otherwise I takes 0.
For the third formula, since the registration source value is a discrete finite value, the third formula adopts an indicative function to measure the registration source between two accounts.
The difference between the two account registration sources can be accurately determined through the formula III. The greater the third similarity, the closer the registered sources of the two accounts are, and the more similar the two accounts are. Conversely, the smaller the third similarity is, the larger the difference between the registered sources of the two accounts is, and the more dissimilar the two accounts are.
For step 204, the difference between the operational behavior of the two accounts can be converted to a numerical value by performing step 204. The numerical value can represent the similarity between the operation behaviors of the two account numbers. Specifically, a fourth similarity between the first account and the second account is obtained by using the following formula four:
wherein behavior-sim (u, v) is the fourth degree of similarity, Δ tiThe operation time difference sequence comprises the minimum value of the operation time difference of the first account and the second account under the same operation behavior, and s is the total length of the operation time difference sequence.
For equation four, based on the characteristics of the black product software, in equation four 2-iRepresents a weight, the shorter the difference in operation time over which the operation action takes place, 2-iThe higher the weight given. Since the black product software is operated in batch, the operation time interval of two accounts operated by the same black product software on 1 or more behaviors is very short, and the shorter the time is, the higher the similarity relationship is, the higher the weight is required to be given to the accounts. While the weight setting satisfies the definition of the number of stages in mathematics, i.e.Although other types of series can be adopted, the series is 1 when the number of elements in the operation time sequence is infinite, so that the final value of the similarity is still ensured to be 0,1]And the subsequent combination with other similarity is convenient to measure the similarity between the account numbers comprehensively. Meanwhile, the more the operation time interval of the two accounts is shorter, the higher the possibility that the two accounts have high similar association is, and the characteristic can be highlighted through the accumulation of the number of stages. 0.01 in the formula is a weight coefficient, the value is fitted based on the data in the current service, and the fitting method is the same as that in the formula I.
Specifically, for one same operation behavior, the sequence of operation time of the operation behavior is that for the first accountFor the second account, the sequence of the operation time when the operation action occurs isWill tuAnd tvAnd comparing to determine the minimum value of the operation time difference. That is, t is determined firstuAnd tvThe two times that are closest together, and the difference between the two times is determined. That is to say that the first and second electrodes,is measured.
For example, if a first account has a login behavior and a check-in behavior, a second account also has a login behavior and a check-in behavior. The operation time corresponding to the login behavior of the first account is 9:00 and 10:00, the operation time corresponding to the login behavior of the second account is 9:03 and 10:15, and the minimum value of the operation time difference between the first account and the second account in the login behavior is 3 minutes, namely delta t13. The operation time corresponding to the check-in behavior of the first account is 9:30 and 10:20, the operation time corresponding to the check-in behavior of the second account is 9:32 and 10:50, and the minimum value of the operation time difference of the first account and the second account in the check-in behavior is 2 minutes, namely delta t22. In the above example, s is 2, and the first account and the second account share two identical operation behaviors.
Note that Δ tiThe sequence of operating time differences can also be understood as: Δ tiAnd forming an operation time difference sequence from i-1 to i-s.
The difference between the two account operation behaviors can be accurately determined through the formula IV, and meanwhile, the method is convenient to combine with other similarity degrees in the aspects of network protocol addresses, registration time and registration sources so as to comprehensively measure the similarity degrees between the accounts. The greater the fourth similarity, the closer the operation behaviors of the two accounts are, and the more similar the two accounts are. Conversely, the smaller the fourth similarity is, the larger the difference between the operation behaviors of the two accounts is, and the more dissimilar the two accounts are. And finding a plurality of operations with the two account operation behaviors having the closest occurrence intervals through a formula IV, and giving different weights according to the occurrence intervals of the operation behaviors, wherein the shorter the interval is, the higher the weight is.
For step 205, the final similarity between two accounts (this similarity may be referred to as a fifth similarity) is obtained according to the first similarity, the second similarity, the third similarity and the fourth similarity. The calculation formula five is as follows:
sim(u,v)=w1[ip-sim(u,v)]+w2[src-sim(u,v)]+w3[time-sim(u,v)]+w4[behavior-sim(u,v)]
wherein, wi(i is 1,2,3,4) is a weight coefficient, takes a value between 0 and 1, and satisfiessim (u, v) is the fifth degree of similarity.
It should be noted that each weight coefficient depends on statistics of batch registered account groups found in the service. The 4 kinds of similarity between every two accounts in the batch registered account groups are counted, the mean value of the similarity is used as the similarity of the batch registered account groups, and if the similarity of the batch registered account groups in a certain aspect is higher, the weight of the aspect is set to be higher. For example, after a batch registered account group found in a week of a certain service is obtained, a first similarity, a second similarity, a third similarity and a fourth similarity between every two accounts in the batch registered account group are determined according to the formula. Then, according to all the determined first similarities, an average value of the first similarities is calculated, for example, the average value of the first similarities is 0.75. And calculating the average value of the second similarity according to all the determined second similarities, wherein the average value of the second similarity is 0.25. And calculating the average value of the third similarity according to all the determined third similarities, wherein the average value of the third similarities is 0.75 if obtained. And calculating the average value of the fourth similarity according to all the determined fourth similarities, wherein the average value of the fourth similarity is 0.75 if the average value of the fourth similarity is obtained. Furthermore, according to the proportional relationship between 0.75, 0.25, 0.75 and 0.75, the weight corresponding to each similarity in the formula five is determined, and the sum of all weights needs to be guaranteed to be equal to 1. Thus, w is finally determined1Is 0.3, w2Is 0.1, w3Is 0.3, w4Is 0.3. Meanwhile, the weight coefficient is adjusted irregularly, and if the average value of the similarity of the batch registered account groups in the near term is found to be changed greatly, the weight coefficient is adjusted. The application shows that the group characteristics of certain aspects are obviously shown in the current situation by calculating the high similarity of the characteristics of the certain aspects (such as the IP characteristics, the registration source characteristics and the operation behavior characteristics with the similarity of 0.75) from the batch registration account groups found in the business. Therefore, the influence of this feature needs to be considered in the calculation of the overall similarity, and therefore a higher weight needs to be set. By the method, the setting of the weight can be connected with the current group feature expression, external change perception is achieved, and the weight can be changed and adjusted immediately once the crime gathering mode changes.
In addition, the fifth similarity is obtained by using the formula five, and is used as the final similarity of the accounts, so that the similarity among a plurality of accounts is comprehensively considered in four aspects of network protocol addresses, registration time, registration sources and operation behaviors, and compared with the prior art that the similarity among the accounts is judged only by the equipment identifications to which the accounts belong, the accuracy of identifying the batch registered accounts is improved, and even if the equipment identifications are tampered by network black products, which accounts belong to the batch registered accounts can be accurately identified.
According to the method and the device for identifying the batch registered account group, the network protocol address, the registration time, the registration source, the operation behavior and the operation time corresponding to the operation behavior of each account in the plurality of accounts are obtained. Next, a similarity between each two of the plurality of accounts is determined. And then, according to the similarity between every two account numbers, clustering the account numbers based on density. And finally, determining account groups with the account number larger than the batch registration threshold value in the clustered account groups as batch registration account groups. The process of determining the similarity between two accounts comprises the following steps: and calculating the similarity of the network protocol addresses based on the network protocol addresses of the two accounts to obtain a first similarity. And calculating the similarity of the registration time based on the registration time of the two accounts to obtain a second similarity. And calculating the similarity of the registered sources based on the registered sources of the two account numbers to obtain a third similarity. And calculating the similarity of the operation behaviors based on the operation behaviors of the two accounts and the operation time corresponding to the operation behaviors to obtain a fourth similarity. And determining the similarity between the two account numbers based on the first similarity, the second similarity, the third similarity, the fourth similarity and the weight coefficients corresponding to the similarities. Through the process, the synchronism among the accounts is analyzed, even if the equipment identification is tampered by a network black product, which accounts in the accounts belong to the batch registered accounts can be still accurately identified, and the identification accuracy of the batch registered accounts is improved.
The similarity determination process of step 201-205 will be described with reference to an example.
The first account number u: the registration time is 14 o 25 m 4/10/2019, the registration source is ios, the registration IP is 110.18.82.28, and the subsequent login behavior is 10 o 15/10 m 4/10/2019.
The second account number v: the registration time is 14 o 45 m 4/10/2019, the registration source is ios, the registration IP is 110.18.82.30, and the subsequent login behavior is carried out at 15 o 20 m 10/2019 and 20 o 20 m 10/4/2019.
For how to calculate the similarity of u and v:
1) calculating the similarity between the network protocol addresses:
ip2long(110.18.82.28)=1846694428
ip2long(110.18.82.30)=1846694430
2) calculating the similarity between the registration times:
since the two account registration time interval is 20 minutes, therefore:
time-sim(u,v)=e-0.01*20=0.819
3) calculating the similarity between the registered sources:
since the two account registration sources are consistent, therefore:
src-sim(u,v)=1
4) calculating the similarity between the operation behaviors:
the account u has only one login behavior, the occurrence time is 15 minutes and 10 minutes in 4/10/2019, the closest time in the login behavior of the account v is 20 minutes in 15 minutes in 4/10/2019, the interval time is 10 minutes, and then the time difference sequence is Δ t ═ 10}, so that:
behavior-sim(u,v)=2-1*e-0.01*10=0.452
5) and (5) calculating the final comprehensive similarity of u and v:
take 4 weight coefficients of 0.3, 0.1, 0.3 and 0.3, respectively, so:
sim(u,v)=0.3*0.999+0.1*1+0.3*0.819+0.3*0.452=0.781
and finally obtaining the similarity between the account u and the account v as 0.781.
For step 103, after determining the similarity between every two account numbers, clustering all the account numbers according to the obtained similarity, specifically, density-based clustering, such as a density-based DBSCAN algorithm, may be adopted. Because the density-based DBSCAN algorithm belongs to the prior art and the clustering method which can be applied by the invention is not limited to the DBSCAN algorithm, the invention does not need to describe any details on how to realize the algorithm.
For step 104, a batch registration threshold is preset, and the batch registration threshold can be obtained according to big data statistics, and the batch registration threshold is used for measuring the number of accounts in the account group. After the batch registration threshold is set, the number of accounts included in each account group obtained by clustering is compared with the threshold. And determining account groups of which the number of accounts is greater than the batch registration threshold value as batch registration account groups. For example, a first account group, a second account group and a third account group are obtained by clustering, the number of accounts of the first account group is 100, the number of accounts of the second account group is 50, the number of accounts of the third account group is 200, and the batch registration threshold is 90, then both the first account group and the third account group are determined as batch registration account groups.
Based on the same inventive concept, a second embodiment of the present invention provides an apparatus for identifying batch registered account groups, as shown in fig. 3, the apparatus includes:
an obtaining module 301, configured to obtain a network protocol address, registration time, a registration source, an operation behavior, and operation time corresponding to the operation behavior of each account in a plurality of accounts;
a first determining module 302, configured to determine similarity between every two account numbers in the plurality of account numbers;
the clustering module 303 is configured to perform density-based clustering on the plurality of account numbers according to the similarity between every two account numbers;
a second determining module 304, configured to determine, as a batch registration account group, an account group with a number of accounts greater than a batch registration threshold value among the clustered account groups;
wherein the first determining module comprises:
a first obtaining unit, configured to perform network protocol address similarity calculation based on network protocol addresses of the two accounts to obtain a first similarity;
the second obtaining unit is used for calculating the similarity of the registration time based on the registration time of the two accounts to obtain a second similarity;
a third obtaining unit, configured to perform registration source similarity calculation based on the registration sources of the two account numbers to obtain a third similarity;
a fourth obtaining unit, configured to perform operation behavior similarity calculation based on the operation behaviors of the two accounts and the operation time corresponding to the operation behaviors, so as to obtain a fourth similarity;
and the determining unit is used for determining the similarity between the two account numbers based on the first similarity, the second similarity, the third similarity, the fourth similarity and the weight coefficients corresponding to the similarities.
Preferably, the first obtaining unit includes the following formula:
wherein IP-sim (u, v) is the first similarity, IPuIs the network protocol address, IP, of the first of the two accountsvIs the network protocol address of the second of the two accounts.
Preferably, the second unit includes the following formula:
wherein time-sim (u, v) is the second degree of similarity, tuIs the registration time, t, of the first of the two accountsvIs the registration time of the second of the two accounts.
Preferably, the third unit includes the following formula:
src-sim(u,v)=I(srcu=srcv)
wherein src-sim (u, v) is the third degree of similarity, srcuSrc, the registered origin of the first of the two accountsvIs the registered source of the second of the two accounts, I is an indicative function, I (src)u=srcv) Denotes if srcu=srcvThen I takes 1, otherwise I takes 0.
Preferably, the fourth unit includes the following formula:
wherein behavior-sim (u, v) is the fourth similarity, Δ tiThe operation time difference sequence comprises the minimum value of the operation time difference of the two account numbers under the same operation behavior, and s isThe total length of the sequence of operational time differences.
Based on the same inventive concept, the third embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the method steps of the previous embodiments.
Based on the same inventive concept, a computer apparatus is further provided in the fourth embodiment of the present invention, as shown in fig. 4, for convenience of description, only the parts related to the embodiment of the present invention are shown, and details of the specific technology are not disclosed, please refer to the method part of the embodiment of the present invention. The computer device may be any terminal device including a mobile phone, a tablet computer, a PDA (Personal Digital Assistant), a POS (Point of Sales), a vehicle-mounted computer, etc., taking the computer device as the mobile phone as an example:
fig. 4 is a block diagram illustrating a partial structure associated with a computer device provided by an embodiment of the present invention. Referring to fig. 4, the computer apparatus includes: a memory 401 and a processor 402. Those skilled in the art will appreciate that the computer device configuration illustrated in FIG. 4 does not constitute a limitation of computer devices, and may include more or fewer components than those illustrated, or some components may be combined, or a different arrangement of components.
The following describes the components of the computer device in detail with reference to fig. 4:
the memory 401 may be used to store software programs and modules, and the processor 402 executes various functional applications and data processing by operating the software programs and modules stored in the memory 401. The memory 401 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.), and the like. Further, the memory 401 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.
The processor 402 is a control center of the computer device, and performs various functions and processes data by operating or executing software programs and/or modules stored in the memory 401 and calling data stored in the memory 401. Alternatively, processor 402 may include one or more processing units; preferably, the processor 402 may integrate an application processor, which primarily handles operating systems, user interfaces, applications, etc., and a modem processor, which primarily handles wireless communications.
In the embodiment of the present invention, the processor 402 included in the computer device may have the functions corresponding to the method steps in any of the foregoing embodiments.
The algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.
In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.
The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functionality of some or all of the components in accordance with embodiments of the present invention. The present invention may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.
Claims (9)
1. A method for identifying a batch registered account group, the method comprising:
acquiring a network protocol address, registration time, a registration source, an operation behavior and operation time corresponding to the operation behavior of each account in a plurality of accounts;
determining similarity between every two account numbers in the plurality of account numbers;
performing density-based clustering on the plurality of account numbers according to the similarity between every two account numbers;
determining account groups of which the number of accounts is larger than a batch registration threshold value in the account groups obtained after clustering as batch registration account groups;
determining the similarity between the two account numbers comprises the following steps:
performing network protocol address similarity calculation based on the network protocol addresses of the two accounts to obtain a first similarity;
calculating the similarity of the registration time based on the registration time of the two accounts to obtain a second similarity;
performing registration source similarity calculation based on the registration sources of the two account numbers to obtain a third similarity;
calculating the similarity of the operation behaviors based on the operation behaviors of the two accounts and the operation time corresponding to the operation behaviors to obtain a fourth similarity, wherein the fourth similarity comprises the following formula:
wherein behavior-sim (u, v) is the fourth similarity, Δ tiThe operation time difference sequence comprises the minimum value of the operation time difference of the two account numbers under the same operation behavior, and s is the total length of the operation time difference sequence;
and determining the similarity between the two account numbers based on the first similarity, the second similarity, the third similarity, the fourth similarity and the weight coefficients corresponding to the similarities.
2. The method of claim 1, wherein the performing a network protocol address similarity calculation based on the network protocol addresses of the two accounts to obtain a first similarity comprises the following formula:
wherein IP-sim (u, v) is the first similarity, IPuIs the network protocol address, IP, of the first of the two accountsvIs the network protocol address of the second of the two accounts.
3. The method of claim 1, wherein the calculating of the similarity of the registration time based on the registration time of the two accounts to obtain the second similarity comprises the following formula:
wherein time-sim (u, v) is the second degree of similarity, tuIs the registration time, t, of the first of the two accountsvIs the registration time of the second of the two accounts.
4. The method of claim 1, wherein the calculating of the similarity of the registered sources based on the registered sources of the two accounts to obtain the third similarity comprises the following formula:
src-sim(u,v)=I(srcu=srcv)
wherein src-sim (u, v) is the third degree of similarity, srcuSrc, the registered origin of the first of the two accountsvIs the registered source of the second of the two accounts, I is an indicative function, I (src)u=srcv) Denotes if srcu=srcvThen I takes 1, otherwise I takes 0.
5. An apparatus for identifying a batch registered account group, the apparatus comprising:
the acquisition module is used for acquiring a network protocol address, registration time, a registration source, an operation behavior and operation time corresponding to the operation behavior of each account in a plurality of accounts;
the first determination module is used for determining the similarity between every two account numbers in the plurality of account numbers;
the clustering module is used for carrying out density-based clustering on the plurality of account numbers according to the similarity between every two account numbers;
the second determining module is used for determining account groups with the account number larger than the batch registration threshold value in the clustered account groups as batch registration account groups;
wherein the first determining module comprises:
a first obtaining unit, configured to perform network protocol address similarity calculation based on network protocol addresses of the two accounts to obtain a first similarity;
the second obtaining unit is used for calculating the similarity of the registration time based on the registration time of the two accounts to obtain a second similarity;
a third obtaining unit, configured to perform registration source similarity calculation based on the registration sources of the two account numbers to obtain a third similarity;
a fourth obtaining unit, configured to perform operation behavior similarity calculation based on the operation behaviors of the two accounts and the operation time corresponding to the operation behavior, and obtain a fourth similarity, where the fourth similarity includes the following formula:
wherein behavior-sim (u, v) is the fourth similarity, Δ tiThe operation time difference sequence comprises the minimum value of the operation time difference of the two account numbers under the same operation behavior, and s is the total length of the operation time difference sequence;
and the determining unit is used for determining the similarity between the two account numbers based on the first similarity, the second similarity, the third similarity, the fourth similarity and the weight coefficients corresponding to the similarities.
8. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method steps of any one of claims 1 to 4.
9. Computer arrangement comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor realizes the method steps of any of claims 1-4 when executing the program.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910622891.7A CN110324352B (en) | 2019-07-11 | 2019-07-11 | Method and device for identifying batch registered account groups |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910622891.7A CN110324352B (en) | 2019-07-11 | 2019-07-11 | Method and device for identifying batch registered account groups |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110324352A CN110324352A (en) | 2019-10-11 |
CN110324352B true CN110324352B (en) | 2021-10-15 |
Family
ID=68121827
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910622891.7A Active CN110324352B (en) | 2019-07-11 | 2019-07-11 | Method and device for identifying batch registered account groups |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110324352B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110855614B (en) * | 2019-10-14 | 2021-12-21 | 微梦创科网络科技(中国)有限公司 | Method and device for processing shared black product information in industry |
CN112785315B (en) * | 2019-11-07 | 2024-06-21 | 北京沃东天骏信息技术有限公司 | Batch registration identification method and device |
CN112700259A (en) * | 2020-12-31 | 2021-04-23 | 苏宁金融科技(南京)有限公司 | Batch registered account identification method and device |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106685898A (en) * | 2015-11-09 | 2017-05-17 | 阿里巴巴集团控股有限公司 | Method and device for identifying batch-registered accounts |
CN106982193A (en) * | 2016-01-18 | 2017-07-25 | 阿里巴巴集团控股有限公司 | A kind of method and device of prevention batch registration |
CN107835154A (en) * | 2017-10-09 | 2018-03-23 | 武汉斗鱼网络科技有限公司 | A kind of batch registration account recognition methods and system |
CN109460930A (en) * | 2018-11-15 | 2019-03-12 | 武汉斗鱼网络科技有限公司 | A kind of method and relevant device of determining adventure account |
CN109561050A (en) * | 2017-09-26 | 2019-04-02 | 武汉斗鱼网络科技有限公司 | A kind of method and apparatus identifying batch account |
CN109977992A (en) * | 2019-01-24 | 2019-07-05 | 平安科技(深圳)有限公司 | Electronic device, the recognition methods of batch registration behavior and storage medium |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10922306B2 (en) * | 2016-12-21 | 2021-02-16 | Aon Global Operations Plc, Singapore Branch | Systems and methods for automated bulk user registration spanning both a content management system and any software applications embedded therein |
-
2019
- 2019-07-11 CN CN201910622891.7A patent/CN110324352B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106685898A (en) * | 2015-11-09 | 2017-05-17 | 阿里巴巴集团控股有限公司 | Method and device for identifying batch-registered accounts |
CN106982193A (en) * | 2016-01-18 | 2017-07-25 | 阿里巴巴集团控股有限公司 | A kind of method and device of prevention batch registration |
CN109561050A (en) * | 2017-09-26 | 2019-04-02 | 武汉斗鱼网络科技有限公司 | A kind of method and apparatus identifying batch account |
CN107835154A (en) * | 2017-10-09 | 2018-03-23 | 武汉斗鱼网络科技有限公司 | A kind of batch registration account recognition methods and system |
CN109460930A (en) * | 2018-11-15 | 2019-03-12 | 武汉斗鱼网络科技有限公司 | A kind of method and relevant device of determining adventure account |
CN109977992A (en) * | 2019-01-24 | 2019-07-05 | 平安科技(深圳)有限公司 | Electronic device, the recognition methods of batch registration behavior and storage medium |
Non-Patent Citations (2)
Title |
---|
Towards Improving Comprehension of Touch ID Authentication with Smartphone Applications;Yousra Javed; Mohamed Shehab;《2017 IEEE Symposium on Privacy-Aware Computing (PAC)》;20171207;206-207页 * |
破坏生产经营罪包括妨害业务行为——批量恶意注册账号的处理;高艳东;《网络犯罪研究》;20160415;14-26页 * |
Also Published As
Publication number | Publication date |
---|---|
CN110324352A (en) | 2019-10-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108768943B (en) | Method and device for detecting abnormal account and server | |
CN110324352B (en) | Method and device for identifying batch registered account groups | |
CN111178760B (en) | Risk monitoring method, risk monitoring device, terminal equipment and computer readable storage medium | |
CN107423613B (en) | Method and device for determining device fingerprint according to similarity and server | |
US20180248879A1 (en) | Method and apparatus for setting access privilege, server and storage medium | |
CN107368856B (en) | Malicious software clustering method and device, computer device and readable storage medium | |
CN110381151B (en) | Abnormal equipment detection method and device | |
CN108366012B (en) | Social relationship establishing method and device and electronic equipment | |
CN110674391B (en) | Product data pushing method and system based on big data and computer equipment | |
CN107808346B (en) | Evaluation method and evaluation device for potential target object | |
CN108829769B (en) | Suspicious group discovery method and device | |
CN110033302A (en) | The recognition methods of malice account and device | |
CN114143049B (en) | Abnormal flow detection method and device, storage medium and electronic equipment | |
CN111612085B (en) | Method and device for detecting abnormal points in peer-to-peer group | |
CN113763057A (en) | User identity portrait data processing method and device | |
CN110889597A (en) | Method and device for detecting abnormal business timing sequence indexes | |
CN110445772B (en) | Internet host scanning method and system based on host relationship | |
CN111756745A (en) | Alarm method, alarm device and terminal equipment | |
CN112437034A (en) | False terminal detection method and device, storage medium and electronic device | |
CN108959289B (en) | Website category acquisition method and device | |
CN113765850A (en) | Internet of things anomaly detection method and device, computing equipment and computer storage medium | |
CN111755092A (en) | Medical data interconnection and intercommunication method and medical system | |
CN111245815A (en) | Data processing method, data processing device, storage medium and electronic equipment | |
CN111581499A (en) | Data normalization method, device and equipment and readable storage medium | |
CN115146729A (en) | Abnormal shop identification method and device, computer equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |