CN112861128A - Method and system for identifying machine accounts in batches - Google Patents
Method and system for identifying machine accounts in batches Download PDFInfo
- Publication number
- CN112861128A CN112861128A CN202110083543.4A CN202110083543A CN112861128A CN 112861128 A CN112861128 A CN 112861128A CN 202110083543 A CN202110083543 A CN 202110083543A CN 112861128 A CN112861128 A CN 112861128A
- Authority
- CN
- China
- Prior art keywords
- account
- key
- behavior
- user
- goodness
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 25
- 230000006399 behavior Effects 0.000 claims abstract description 222
- 238000012417 linear regression Methods 0.000 claims abstract description 59
- 239000000284 extract Substances 0.000 claims abstract description 5
- 230000010354 integration Effects 0.000 claims description 14
- 238000013515 script Methods 0.000 claims description 10
- 238000006243 chemical reaction Methods 0.000 claims description 8
- 230000001419 dependent effect Effects 0.000 claims description 8
- 238000004364 calculation method Methods 0.000 claims description 7
- 230000036962 time dependent Effects 0.000 claims 1
- 208000012260 Accidental injury Diseases 0.000 abstract description 6
- 230000008859 change Effects 0.000 abstract description 6
- 208000014674 injury Diseases 0.000 abstract description 6
- 230000006870 function Effects 0.000 description 12
- 230000008569 process Effects 0.000 description 4
- 230000009286 beneficial effect Effects 0.000 description 3
- 238000013461 design Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000012216 screening Methods 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 2
- 101000932776 Homo sapiens Uncharacterized protein C1orf115 Proteins 0.000 description 1
- 102100025480 Uncharacterized protein C1orf115 Human genes 0.000 description 1
- 230000002159 abnormal effect Effects 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 238000003360 curve fit method Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/56—Computer malware detection or handling, e.g. anti-virus arrangements
- G06F21/566—Dynamic detection, i.e. detection performed at run-time, e.g. emulation, suspicious activities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2221/00—Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F2221/21—Indexing scheme relating to G06F21/00 and subgroups addressing additional information or applications relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F2221/2141—Access rights, e.g. capability lists, access control lists, access tables, access matrices
Abstract
The embodiment of the invention provides a method and a system for identifying machine accounts in batches.A computing engine Spark periodically acquires a user behavior log of login accounts in the previous period from a database, and extracts all accounts of which the behavior number exceeds a preset number threshold in the previous period; acquiring the occurrence time of all key behaviors of each account in the previous period and forming an elastic data set of each account; sequentially counting the number of key behaviors occurring in each length time period according to a preset length time period; fitting the change relation of the key behavior quantity of any account within each length time period along with time by adopting a linear regression equation to obtain a linear regression fitting curve of the account; calculating the goodness of fit of the key behavior data corresponding to each account according to a linear regression fitting curve; and judging whether each account is a machine account in batch according to the goodness of fit of the key behavior data of each account. And searching key behaviors of the account number based on Spark, and reducing the accidental injury rate of the non-machine account number.
Description
Technical Field
The invention relates to the field of computers, in particular to a method and a system for identifying machine accounts in batches.
Background
In a modern internet social platform of social media, a large number of lawless persons log in some accounts in batch by using scripts to perform illegal operations such as swiping amounts and the like, and the accounts generally have no substantial content, so that negative effects are brought to normal use of users, and certain challenges are brought to fairness of the platform. Therefore, the machine accounts logged in batch by using the script need to be found in batch.
In the process of implementing the invention, the applicant finds that at least the following problems exist in the prior art: the prior art generally counts the daily visit amount of each user, and then ranks the visit amount from high to low, and considers that the top 5 percent of users are machine accounts. Although some machine account numbers can be found, the accidental injury rate is high, especially for head account numbers, which is unacceptable for normal users.
Disclosure of Invention
The embodiment of the invention provides a method and a system for identifying machine account numbers in batches.
To achieve the above object, in one aspect, an embodiment of the present invention provides a method for batch identifying machine accounts, including:
a computing engine Spark periodically acquires a user behavior log of login accounts in the previous period from a database, extracts all accounts of which the behavior quantity exceeds a preset quantity threshold value in the previous period, and forms a user account set;
acquiring the occurrence time of all key behaviors of each account in a user account set in a previous period from a user behavior log of a database, and forming an elastic data set of each account according to the occurrence time of all key behaviors of each account;
sequentially counting the number of key behaviors occurring in each length time period according to a preset length time period according to an elastic data set of an account number aiming at any account number in a user account number set; the key behaviors refer to behaviors which are performed by a user in an account number authority range and reach a preset important level;
aiming at any account in the user account set, fitting the time-varying relation of the key behavior quantity of the account in each length time period by adopting a linear regression equation to obtain a linear regression fitting curve of the key behavior data of the account; calculating the goodness of fit of key behavior data corresponding to each account according to a linear regression fitting curve of each account in the user account set;
judging whether each account is a machine account in batch according to the goodness of fit of key behavior data of each account in the user account set; the machine account is an account that is registered in batch by using a script to perform an illegal operation.
In another aspect, an embodiment of the present invention provides a system for identifying machine accounts in batches, including a database and a compute engine Spark, where the compute engine Spark includes: the device comprises a key behavior data integration unit, a linear regression unit and a judgment unit, wherein:
the database is used for storing a user behavior log of the login account;
the key behavior data integration unit is used for periodically acquiring a user behavior log of the login account in the previous period from the database, extracting all accounts of which the behavior quantity exceeds a preset quantity threshold value in the previous period, and forming a user account set; acquiring the occurrence time of all key behaviors of each account in a user account set in a previous period from a user behavior log of a database, and forming an elastic data set of each account according to the occurrence time of all key behaviors of each account;
sequentially counting the number of key behaviors occurring in each length time period according to a preset length time period according to an elastic data set of an account number aiming at any account number in a user account number set; the key behaviors refer to behaviors which are performed by a user in an account number authority range and reach a preset important level;
the linear regression unit is used for fitting the time-varying relation of the key behavior quantity of the account in each length time period by adopting a linear regression equation aiming at any account in the user account set to obtain a linear regression fitting curve of the key behavior data of the account; calculating the goodness of fit of key behavior data corresponding to each account according to a linear regression fitting curve of each account in the user account set;
the judging unit is used for judging whether each account is a machine account in batch according to the goodness of fit of key behavior data of each account in the user account set; the machine account is an account that is registered in batch by using a script to perform an illegal operation.
The technical scheme has the following beneficial effects: the key behaviors of the account numbers are searched based on Spark, linear regression is adopted for fitting according to the change of key behavior data to obtain a fitting curve, the fitting goodness of the fitting curve is calculated, the machine account numbers can be identified in batches, the machine account numbers with low frequency can be found out by screening the key behaviors, the finding rate of the machine account numbers with low frequency is improved, meanwhile, the accidental injury rate of the non-machine account numbers is reduced, and the work of automatically identifying the machine account numbers in batches can be achieved through Spark.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flowchart of a method for batch identification of machine accounts according to an embodiment of the present invention;
fig. 2 is a system configuration diagram for batch recognition of machine accounts according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1, in combination with the embodiment of the present invention, there is provided a method for batch identification of machine accounts, including:
s101: a computing engine Spark periodically acquires a user behavior log of login accounts in the previous period from a database, extracts all accounts of which the behavior quantity exceeds a preset quantity threshold value in the previous period, and forms a user account set;
acquiring the occurrence time of all key behaviors of each account in a user account set in a previous period from a user behavior log of a database, and forming an elastic data set of each account according to the occurrence time of all key behaviors of each account;
sequentially counting the number of key behaviors occurring in each length time period according to a preset length time period according to an elastic data set of an account number aiming at any account number in a user account number set; the key behaviors refer to behaviors which are performed by a user in an account number authority range and reach a preset important level;
s102: aiming at any account in the user account set, fitting the time-varying relation of the key behavior quantity of the account in each length time period by adopting a linear regression equation to obtain a linear regression fitting curve of the key behavior data of the account; calculating the goodness of fit of key behavior data corresponding to each account according to a linear regression fitting curve of each account in the user account set;
s103: judging whether each account is a machine account in batch according to the goodness of fit of key behavior data of each account in the user account set; the machine account is an account that is registered in batch by using a script to perform an illegal operation.
Preferably, in step 101, the obtaining, in the user behavior log of the database, the occurrence time of all key behaviors of each account in the user account set in the previous period, and forming the occurrence time of all key behaviors of each account into an elastic data set of each account specifically includes:
s1011: the computing engine Spark acquires the occurrence time of all key behaviors of each account in a user account set in a previous period in a user behavior log of a database, and forms an intermediate elastic data set comprising the account in which the key behavior occurs and the occurrence time of the key behavior aiming at each key behavior;
s1012: all intermediate elastic data sets of the same account number are obtained through a groupByKey function of a computing engine Sspark, the occurrence time of all key behaviors in each intermediate elastic data set of the account number forms an array, and the account number and the array of the occurrence time of all key behaviors of the account number form the elastic data set of the account number.
Preferably, the method further comprises the following steps:
s1013: after the intermediate elastic data sets are formed, for each intermediate elastic data set, subtracting the starting time of the current period from the occurrence time of the key behavior by using a mapto Pair function of a calculation engine Spark to obtain the relative time of occurrence of each key behavior, and converting the unit of each relative time to obtain the conversion time of occurrence of each key behavior to obtain an optimized intermediate elastic data set; and the optimized intermediate elastic data set is used as an object obtained by the groupByKey function to form an elastic data set of each account. The general trend of the number of the key behaviors occurring in each length time period of the same account is that the number of the key behaviors in the unit conversion time in the period is larger than the number of the key behaviors in the unit relative time.
Preferably, step 102 specifically includes:
s1021: aiming at any account, taking a preset length time period as an independent variable of a linear regression equation, and taking the key behavior quantity of the account as a dependent variable of the linear regression equation to obtain a linear regression fitting curve of the key behavior data of the account;
s1022: calculating the mean square error of the dependent variable estimation value in the linear regression fitting curve of the account key behavior data, and calculating the actual variance of the account key behavior data according to the key behavior quantity of the account in each preset length time period; and taking the ratio of the mean square error to the actual variance of the key behavior data of each account as the goodness of fit of the key behavior data of each account.
Preferably, the steps specifically include:
s1031: comparing the goodness of fit of key behavior data of each account in the user account set with a set goodness threshold in batch;
s1032: when the fitting goodness of the key behavior data of a certain account is greater than or equal to the goodness threshold, determining the account as a machine account; and when the goodness of fit of the key behavior data of a certain account is smaller than a goodness threshold, determining that the account is a non-machine account.
As shown in fig. 2, in combination with the embodiment of the present invention, there is also provided a system for batch recognition of machine accounts, including a database and a compute engine Spark, where the compute engine Spark includes: a key behavior data integration unit 21, a linear regression unit 22, and a judgment unit 23, wherein:
the database is used for storing a user behavior log of the login account;
the key behavior data integration unit 21 is configured to periodically obtain a user behavior log of the login account in the previous period from the database, extract all accounts of which the behavior number exceeds a preset number threshold in the previous period, and form a user account set; acquiring the occurrence time of all key behaviors of each account in a user account set in a previous period from a user behavior log of a database, and forming an elastic data set of each account according to the occurrence time of all key behaviors of each account; sequentially counting the number of key behaviors occurring in each length time period according to a preset length time period according to an elastic data set of an account number aiming at any account number in a user account number set; the key behaviors refer to behaviors which are performed by a user in an account number authority range and reach a preset important level;
the linear regression unit 22 is configured to fit, by using a linear regression equation, a change relationship of the number of the key behaviors of the account in each length time period with time to obtain a linear regression fit curve of the key behavior data of the account, for any account in the user account set; calculating the goodness of fit of key behavior data corresponding to each account according to a linear regression fitting curve of each account in the user account set;
the judging unit 23 is configured to judge whether each account is a machine account in batch according to the goodness of fit of the key behavior data of each account in the user account set; the machine account is an account that is registered in batch by using a script to perform an illegal operation.
Preferably, the key behavior data integration unit 21 includes:
the intermediate elastic data set subunit 211 is configured to obtain, in a user behavior log of the database, occurrence times of all key behaviors of each account in the user account set in a previous period, and form, for each key behavior, an intermediate elastic data set including the account where the key behavior occurs and the occurrence time of the key behavior;
the key behavior data integration subunit 212 is configured to obtain all intermediate elastic data sets of the same account through a groupByKey function of the compute engine Sspark, form an array of occurrence times of all key behaviors in each intermediate elastic data set of the account, and form an elastic data set of the account from the account and the array of occurrence times of all key behaviors of the account.
Preferably, the critical behavior data integration unit 21 further includes:
an intermediate elastic data set optimizing subunit 213, configured to, after the intermediate elastic data sets are formed, obtain, for each intermediate elastic data set, relative time for each key behavior by subtracting the starting time of the current cycle from the occurrence time of the key behavior by using a mapto pair function of a compute engine Spark, and convert a unit of each relative time to obtain a conversion time for each key behavior, so as to obtain an optimized intermediate elastic data set;
the key behavior data integration subunit 21 is specifically configured to use the optimized intermediate elastic data set as an object obtained by the groupByKey function to form an elastic data set of each account.
Preferably, the linear regression unit 22 includes:
the linear fitting subunit 221 is configured to, for any account, use a preset length time period as an independent variable of a linear regression equation, and use the number of key behaviors of the account as a dependent variable of the linear regression equation to obtain a linear regression fitting curve of the key behavior data of the account;
a fitting quality calculation operator unit 222, configured to calculate a mean square error of a dependent variable estimation in a linear regression fitting curve of the account key behavior data, and calculate an actual variance of the account key behavior data according to the number of key behaviors of the account in each preset length time period; and taking the ratio of the mean square error to the actual variance of the key behavior data of each account as the goodness of fit of the key behavior data of each account. Wherein, the time period of the preset length can be a unit conversion time.
Preferably, the judging unit 23 includes:
a comparing subunit 231, configured to compare, in batches, the goodness of fit of the key behavior data of each account in the user account set with a set goodness threshold;
a determining subunit 232, configured to determine that a certain account is a machine account when the goodness of fit of the key behavior data of the account is greater than or equal to a goodness threshold; and when the goodness of fit of the key behavior data of a certain account is smaller than a goodness threshold, determining that the account is a non-machine account.
The embodiment of the invention has the following beneficial effects:
the key behaviors of the account numbers are searched based on Spark, linear regression is adopted for fitting according to the change of key behavior data to obtain a fitting curve, the fitting goodness of the fitting curve is calculated, the machine account numbers can be identified in batches, the machine account numbers with low frequency can be found out by screening the key behaviors, the finding rate of the machine account numbers with low frequency is improved, meanwhile, the accidental injury rate of the non-machine account numbers is reduced, and the work of automatically identifying the machine account numbers in batches can be achieved through Spark.
The above technical solutions of the embodiments of the present invention are described in detail below with reference to specific application examples, and reference may be made to the foregoing related descriptions for technical details that are not described in the implementation process.
The technical terms involved in the invention are explained as follows:
machine account number: in a modern internet social platform of social media, a large number of lawless persons log in some accounts in batch by using scripts to perform illegal operations such as swiping amounts and the like, and the accounts generally have no substantial content, so that negative effects are brought to normal use of users, and certain challenges are brought to fairness of the platform.
And (3) behavior logging: and logs recorded when the internet account performs uplink operation, such as behavior of praise, comment, attention and the like. The information includes operation behavior number, account number, time, target and other information.
The invention relates to a Spark and linear regression-based machine account number batch identification system and method, which can automatically find out machine account numbers logged in batch by using scripts in a batch manner through a data mining and analyzing mode. The method and the system have the advantages that the machine account number with low-frequency access can be found out, the finding rate of the machine account number with low-frequency access is very high, and the accidental injury rate of the whole system is reduced.
The invention relates to a machine account number batch identification system and method based on Spark and linear regression, which adopts the complete technical scheme as follows:
1. for all the user sets U (i.e. user account sets) whose number of behaviors (like, comment, forward) exceeds C on the last day.
2. Querying the time of the key behaviors of all uids in U in yesterday by using Spark's hive query, and forming the time stamps of the key behaviors into an intermediate elastic data set RDD1 with the format of [ uid, t ]; wherein Spark is a calculation engine and is set for the distributed cluster, and hive is a database.
3. Using Spark's mapPair function, the timestamp of t minus yesterday 0 is divided by t0 rounded (3600s is appropriate) to form the optimized intermediate elastic dataset RDD2, formatted as [ uid, h ]. Namely, for the number of the key behaviors occurring in each length of time period of the same account, the general trend is that the number of the key behaviors in the unit conversion time in the period is larger than that in the unit relative time.
4. The h values of the same uid are grouped together using the groupByKey function of Spark to form an elastic data set RDD3 for the account, with the format [ uid, [ h0, h1 … ].
5. For any account in the user account set, sequentially counting the number of key behaviors occurring in each length time period according to the elastic data set of the account and the preset length time period, namely: taking RDD3 out of Spark using Spark's collect function forms array L, for each element in L: and counting the total behavior amount every T0 time, namely obtaining the total behavior amount T0 of the user from 0 to T0 time, the total behavior amount T1 of T0 to 2T0 and the total behavior amount T2 of 2T0 to 3T 0. . . And so on, forming a list T.
6. Aiming at any account in the user account set, fitting the time-varying relation of the key behavior quantity of the account in each length time period by adopting a linear regression equation to obtain a linear regression fitting curve of the key behavior data of the account; and calculating the goodness of fit of the key behavior data corresponding to each account according to the linear regression fitting curve of each account in the user account set.
Performing goodness of fit test, and if the sequences T0, T1 and T2 are almost fixed numbers with little change, performing linear regression on the goodness of fit R2Will be high.
7. Defining a threshold value R0, if R2>R0 and the account number is considered to be a machine account number.
Specific examples are as follows:
for all users with the behavior number of more than 1000 in the last day, how many key behavior records are queried in hive, such as [1:20201010080810,1:20201010080910 … ], indicating that user number 1 initiated the key behavior at 2020101008081020201010080910.
Steps 2 and 3 are then followed by the actual conversion of the timestamp to the hour of the action, i.e. [1:8,1:8 … ];
then, in step 4, all the uids are aggregated together to obtain [ uid: list of hours in which the key behavior is located ] data, namely [1: [8,8,9,9,10,10, 11, 11 ],2: [9,10,18,18,18. ] … ].
For one of the users, assume his behavior list is [0, 0,1, 1,2, 2,3, 3, … 23], and then count the behavior amount every T0 to get T (if T0 is one hour, the length of T is 24):
[2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1]
it can be seen that this account behaves uniformly at any time of day, much like a machine account. Because the line graph of T for the machine account number will be very smooth, resembling a straight line. While normal users are generally not visiting at night, and have a certain time to rest in one day. Behavior will fluctuate significantly following T, that is, if linear regression is used for fitting, the fitting effect of the machine account will be very good; further, if a linear regression fit is performed using a T sequence, the better the fit, the more likely it is an abnormal account number.
Below is R2And (4) calculating.
There are many kinds of software that can help us implement the optimization fit, i use here the curve _ fit method of python and scipy packages.
(x) is defined as a straight line y ═ ax + b, then: a
popt,pcov=curve_fit(f,x,T)
The length of x ═ 0,1,2,3 … is defined to be consistent with the length of T.
After executing this statement, popt is loaded with the optimized b and a.
Calculation of goodness of fit R-square:
yvals=f(x)
sum0=0
sum1=0
average=numpy.average(T)
for i in range(len(yvals)):
sum0+=(T[i]-yvals[i])**2
sum1+=(T[i]-average)**2
R2=1-(sum0/sum1)
the result of this user's T is R2About 0.9995, where R0 is 0.98, indicating that R is2>R0 judges the user as the machine user
Looking again at T for a normal user:
[1,0,0,0,0,0,0,0,1,0,2,1,10,10,0,0,0,4,0,19,20,40,40,20];
R2about 0.2, knowing that R2<R0;
The user is determined to be a normal user.
The embodiment of the invention has the following beneficial effects:
the key behaviors of the account numbers are searched based on Spark, linear regression is adopted for fitting according to the change of key behavior data to obtain a fitting curve, the fitting goodness of the fitting curve is calculated, the machine account numbers can be identified in batches, the machine account numbers with low frequency can be found out by screening the key behaviors, the finding rate of the machine account numbers with low frequency is improved, meanwhile, the accidental injury rate of the non-machine account numbers is reduced, and the work of automatically identifying the machine account numbers in batches can be achieved through Spark.
It should be understood that the specific order or hierarchy of steps in the processes disclosed is an example of exemplary approaches. Based upon design preferences, it is understood that the specific order or hierarchy of steps in the processes may be rearranged without departing from the scope of the present disclosure. The accompanying method claims present elements of the various steps in a sample order, and are not intended to be limited to the specific order or hierarchy presented.
In the foregoing detailed description, various features are grouped together in a single embodiment for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments of the subject matter require more features than are expressly recited in each claim. Rather, as the following claims reflect, invention lies in less than all features of a single disclosed embodiment. Thus, the following claims are hereby expressly incorporated into the detailed description, with each claim standing on its own as a separate preferred embodiment of the invention.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. To those skilled in the art; various modifications to these embodiments will be readily apparent, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
What has been described above includes examples of one or more embodiments. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the aforementioned embodiments, but one of ordinary skill in the art may recognize that many further combinations and permutations of various embodiments are possible. Accordingly, the embodiments described herein are intended to embrace all such alterations, modifications and variations that fall within the scope of the appended claims. Furthermore, to the extent that the term "includes" is used in either the detailed description or the claims, such term is intended to be inclusive in a manner similar to the term "comprising" as "comprising" is interpreted when employed as a transitional word in a claim. Furthermore, any use of the term "or" in the specification of the claims is intended to mean a "non-exclusive or".
Those of skill in the art will further appreciate that the various illustrative logical blocks, units, and steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate the interchangeability of hardware and software, various illustrative components, elements, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design requirements of the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present embodiments.
The various illustrative logical blocks, or elements, described in connection with the embodiments disclosed herein may be implemented or performed with a general purpose processor, a digital signal processor, an Application Specific Integrated Circuit (ASIC), a field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general-purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a digital signal processor and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a digital signal processor core, or any other similar configuration.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may be stored in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. For example, a storage medium may be coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC, which may be located in a user terminal. In the alternative, the processor and the storage medium may reside in different components in a user terminal.
In one or more exemplary designs, the functions described above in connection with the embodiments of the invention may be implemented in hardware, software, firmware, or any combination of the three. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media that facilitate transfer of a computer program from one place to another. Storage media may be any available media that can be accessed by a general purpose or special purpose computer. For example, such computer-readable media can include, but is not limited to, RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to carry or store program code in the form of instructions or data structures and which can be read by a general-purpose or special-purpose computer, or a general-purpose or special-purpose processor. Additionally, any connection is properly termed a computer-readable medium, and, thus, is included if the software is transmitted from a website, server, or other remote source via a coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wirelessly, e.g., infrared, radio, and microwave. Such discs (disk) and disks (disc) include compact disks, laser disks, optical disks, DVDs, floppy disks and blu-ray disks where disks usually reproduce data magnetically, while disks usually reproduce data optically with lasers. Combinations of the above may also be included in the computer-readable medium.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.
Claims (10)
1. A method for batch identification of machine accounts, comprising:
a computing engine Spark periodically acquires a user behavior log of login accounts in the previous period from a database, extracts all accounts of which the behavior quantity exceeds a preset quantity threshold value in the previous period, and forms a user account set;
acquiring the occurrence time of all key behaviors of each account in a user account set in a previous period from a user behavior log of a database, and forming an elastic data set of each account according to the occurrence time of all key behaviors of each account;
sequentially counting the number of key behaviors occurring in each length time period according to a preset length time period according to an elastic data set of an account number aiming at any account number in a user account number set; the key behaviors refer to behaviors which are performed by a user in an account number authority range and reach a preset important level;
aiming at any account in the user account set, fitting the time-varying relation of the key behavior quantity of the account in each length time period by adopting a linear regression equation to obtain a linear regression fitting curve of the key behavior data of the account; calculating the goodness of fit of key behavior data corresponding to each account according to a linear regression fitting curve of each account in the user account set;
judging whether each account is a machine account in batch according to the goodness of fit of key behavior data of each account in the user account set; the machine account is an account that is registered in batch by using a script to perform an illegal operation.
2. The method for batch identification of machine accounts according to claim 1, wherein the step of obtaining occurrence times of all key behaviors of each account in a previous cycle in a user behavior log of a database and forming the occurrence times of all key behaviors of each account into an elastic data set of each account specifically comprises:
the computing engine Spark acquires the occurrence time of all key behaviors of each account in a user account set in a previous period in a user behavior log of a database, and forms an intermediate elastic data set comprising the account in which the key behavior occurs and the occurrence time of the key behavior aiming at each key behavior;
all intermediate elastic data sets of the same account number are obtained through a groupByKey function of a computing engine Sspark, the occurrence time of all key behaviors in each intermediate elastic data set of the account number forms an array, and the account number and the array of the occurrence time of all key behaviors of the account number form the elastic data set of the account number.
3. The method for batch identification of machine accounts according to claim 2, further comprising:
after the intermediate elastic data sets are formed, for each intermediate elastic data set, subtracting the starting time of the current period from the occurrence time of the key behavior by using a mapto Pair function of a calculation engine Spark to obtain the relative time of occurrence of each key behavior, and converting the unit of each relative time to obtain the conversion time of occurrence of each key behavior to obtain an optimized intermediate elastic data set; and the optimized intermediate elastic data set is used as an object obtained by the groupByKey function to form an elastic data set of each account.
4. The method for batch identification of machine accounts according to claim 2, wherein the fitting of the time-dependent variation relationship of the number of the key behaviors of the account in each length time period by using a linear regression equation for any account in the user account set to obtain a linear regression fitting curve of the key behavior data of the account specifically comprises:
aiming at any account, taking a preset length time period as an independent variable of a linear regression equation, and taking the key behavior quantity of the account as a dependent variable of the linear regression equation to obtain a linear regression fitting curve of the key behavior data of the account;
the calculating of the goodness of fit of the key behavior data corresponding to each account according to the linear regression fitting curve of each account in the user account set specifically includes:
calculating the mean square error of the dependent variable estimation value in the linear regression fitting curve of the account key behavior data, and calculating the actual variance of the account key behavior data according to the key behavior quantity of the account in each preset length time period; and taking the ratio of the mean square error to the actual variance of the key behavior data of each account as the goodness of fit of the key behavior data of each account.
5. The method for batch identification of machine accounts according to claim 4, wherein the batch judgment of whether each account is a machine account according to the goodness of fit of the key behavior data of each account specifically comprises:
comparing the goodness of fit of key behavior data of each account in the user account set with a set goodness threshold in batch;
when the fitting goodness of the key behavior data of a certain account is greater than or equal to the goodness threshold, determining the account as a machine account;
and when the goodness of fit of the key behavior data of a certain account is smaller than a goodness threshold, determining that the account is a non-machine account.
6. A system for batch recognition of machine accounts, comprising a database and a compute engine Spark, wherein the compute engine Spark comprises: the device comprises a key behavior data integration unit, a linear regression unit and a judgment unit, wherein:
the database is used for storing a user behavior log of the login account;
the key behavior data integration unit is used for periodically acquiring a user behavior log of the login account in the previous period from the database, extracting all accounts of which the behavior quantity exceeds a preset quantity threshold value in the previous period, and forming a user account set; acquiring the occurrence time of all key behaviors of each account in a user account set in a previous period from a user behavior log of a database, and forming an elastic data set of each account according to the occurrence time of all key behaviors of each account; sequentially counting the number of key behaviors occurring in each length time period according to a preset length time period according to an elastic data set of an account number aiming at any account number in a user account number set; the key behaviors refer to behaviors which are performed by a user in an account number authority range and reach a preset important level;
the linear regression unit is used for fitting the time-varying relation of the key behavior quantity of the account in each length time period by adopting a linear regression equation aiming at any account in the user account set to obtain a linear regression fitting curve of the key behavior data of the account; calculating the goodness of fit of key behavior data corresponding to each account according to a linear regression fitting curve of each account in the user account set;
the judging unit is used for judging whether each account is a machine account in batch according to the goodness of fit of key behavior data of each account in the user account set; the machine account is an account that is registered in batch by using a script to perform an illegal operation.
7. The system for batch identification of machine accounts of claim 6, wherein the key behavior data integration unit comprises:
the middle elastic data set subunit is used for acquiring the occurrence time of all key behaviors of each account in the user account set in the previous period in a user behavior log of the database, and forming a middle elastic data set comprising the account of each key behavior and the occurrence time of the key behavior aiming at each key behavior;
and the key behavior data integration subunit is used for acquiring all intermediate elastic data sets of the same account through a groupByKey function of the computing engine Sspark, forming the occurrence time of all key behaviors in each intermediate elastic data set of the account into an array, and forming the account and the array of the occurrence time of all key behaviors of the account into the elastic data set of the account.
8. The system for batch identification of machine accounts according to claim 7, wherein the key behavior data integration unit further comprises:
the middle elastic data set optimizing subunit is used for obtaining the relative time of each key behavior by subtracting the starting time of the current period from the occurrence time of the key behavior through a mapto Pair function of a calculation engine Spark after the middle elastic data sets are formed, and converting the unit of each relative time to obtain the conversion time of each key behavior to obtain the optimized middle elastic data sets;
the key behavior data integration subunit is specifically configured to use the optimized intermediate elastic data set as an object obtained by the groupByKey function to form an elastic data set of each account.
9. The system for batch identification of machine accounts of claim 7, wherein the linear regression unit comprises:
the linear fitting subunit is used for taking a preset length time period as an independent variable of a linear regression equation and taking the key behavior quantity of the account as a dependent variable of the linear regression equation aiming at any account to obtain a linear regression fitting curve of the key behavior data of the account;
a goodness-of-fit calculation subunit, configured to calculate a mean square error of a dependent variable estimated value in a linear regression fitting curve of the account key behavior data, and calculate an actual variance of the account key behavior data according to the number of key behaviors of the account in each preset length period; and taking the ratio of the mean square error to the actual variance of the key behavior data of each account as the goodness of fit of the key behavior data of each account.
10. The system for batch identification of machine accounts according to claim 9, wherein the determining unit includes:
the comparison subunit is used for comparing the goodness of fit of the key behavior data of each account in the user account set with a set goodness threshold in batch;
the judging subunit is used for judging that the account is a machine account when the goodness of fit of the key behavior data of the account is greater than or equal to a goodness threshold; and when the goodness of fit of the key behavior data of a certain account is smaller than a goodness threshold, determining that the account is a non-machine account.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110083543.4A CN112861128A (en) | 2021-01-21 | 2021-01-21 | Method and system for identifying machine accounts in batches |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110083543.4A CN112861128A (en) | 2021-01-21 | 2021-01-21 | Method and system for identifying machine accounts in batches |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112861128A true CN112861128A (en) | 2021-05-28 |
Family
ID=76008938
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110083543.4A Pending CN112861128A (en) | 2021-01-21 | 2021-01-21 | Method and system for identifying machine accounts in batches |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112861128A (en) |
Citations (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102571484A (en) * | 2011-12-14 | 2012-07-11 | 上海交通大学 | Method for detecting and finding online water army |
CN103839197A (en) * | 2014-03-19 | 2014-06-04 | 国家电网公司 | Method for judging abnormal electricity consumption behaviors of users based on EEMD method |
JP2014160344A (en) * | 2013-02-19 | 2014-09-04 | Nippon Telegr & Teleph Corp <Ntt> | Bot determination device and method and program and numerical value aggregate distribution determination device |
JP2015141456A (en) * | 2014-01-27 | 2015-08-03 | Kddi株式会社 | bot determination device, bot determination method, and program |
CN106886915A (en) * | 2017-01-17 | 2017-06-23 | 华南理工大学 | A kind of ad click predictor method based on time decay sampling |
CN107305611A (en) * | 2016-04-22 | 2017-10-31 | 腾讯科技(深圳)有限公司 | The corresponding method for establishing model of malice account and device, the method and apparatus of malice account identification |
CN108109015A (en) * | 2017-12-29 | 2018-06-01 | 广州品唯软件有限公司 | A kind of marketing selective analysis method and device |
US20180234447A1 (en) * | 2015-08-07 | 2018-08-16 | Stc.Unm | System and methods for detecting bots real-time |
CN109359848A (en) * | 2018-10-09 | 2019-02-19 | 烟台海颐软件股份有限公司 | A kind of extremely relevant electricity consumer recognition methods of line loss and system |
JP2019054715A (en) * | 2017-09-15 | 2019-04-04 | 東京電力ホールディングス株式会社 | Power theft monitoring system, power theft monitoring device, power theft monitoring method and program |
CN109818921A (en) * | 2018-12-14 | 2019-05-28 | 微梦创科网络科技(中国)有限公司 | A kind of analysis method and device of the improper flow of website interface |
CN110288114A (en) * | 2019-03-22 | 2019-09-27 | 国网浙江省电力有限公司信息通信分公司 | Violation electricity consumption behavior prediction method based on power marketing data |
US20200084219A1 (en) * | 2018-09-06 | 2020-03-12 | International Business Machines Corporation | Suspicious activity detection in computer networks |
CN110988422A (en) * | 2019-12-19 | 2020-04-10 | 北京中电普华信息技术有限公司 | Electricity stealing identification method and device and electronic equipment |
CN111159399A (en) * | 2019-12-13 | 2020-05-15 | 天津大学 | Automobile vertical website water army discrimination method |
CN111275416A (en) * | 2020-01-15 | 2020-06-12 | 中国人民解放军国防科技大学 | Digital currency abnormal transaction detection method and device, electronic equipment and medium |
CN111368254A (en) * | 2020-03-02 | 2020-07-03 | 西安邮电大学 | Multi-view data missing completion method for multi-manifold regularization non-negative matrix factorization |
CN111507611A (en) * | 2020-04-15 | 2020-08-07 | 北京中电普华信息技术有限公司 | Method and system for determining electricity stealing suspected user |
CN111507377A (en) * | 2020-03-24 | 2020-08-07 | 微梦创科网络科技(中国)有限公司 | Number maintenance account number batch identification method and device |
CN111984695A (en) * | 2020-07-21 | 2020-11-24 | 微梦创科网络科技(中国)有限公司 | Method and system for determining black grouping based on Spark |
CN112000711A (en) * | 2020-07-21 | 2020-11-27 | 微梦创科网络科技(中国)有限公司 | Method and system for determining evaluation user based on Spark |
CN112084229A (en) * | 2020-07-27 | 2020-12-15 | 北京市燃气集团有限责任公司 | Method and device for identifying abnormal gas consumption behaviors of town gas users |
CN112115324A (en) * | 2020-08-10 | 2020-12-22 | 微梦创科网络科技(中国)有限公司 | Method and device for confirming praise-refreshing user based on power law distribution |
CN112149036A (en) * | 2020-09-28 | 2020-12-29 | 微梦创科网络科技(中国)有限公司 | Method and system for identifying batch abnormal interaction behaviors |
CN112148947A (en) * | 2020-09-28 | 2020-12-29 | 微梦创科网络科技(中国)有限公司 | Method and system for mining and reviewing users in batches |
-
2021
- 2021-01-21 CN CN202110083543.4A patent/CN112861128A/en active Pending
Patent Citations (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102571484A (en) * | 2011-12-14 | 2012-07-11 | 上海交通大学 | Method for detecting and finding online water army |
JP2014160344A (en) * | 2013-02-19 | 2014-09-04 | Nippon Telegr & Teleph Corp <Ntt> | Bot determination device and method and program and numerical value aggregate distribution determination device |
JP2015141456A (en) * | 2014-01-27 | 2015-08-03 | Kddi株式会社 | bot determination device, bot determination method, and program |
CN103839197A (en) * | 2014-03-19 | 2014-06-04 | 国家电网公司 | Method for judging abnormal electricity consumption behaviors of users based on EEMD method |
US20180234447A1 (en) * | 2015-08-07 | 2018-08-16 | Stc.Unm | System and methods for detecting bots real-time |
CN107305611A (en) * | 2016-04-22 | 2017-10-31 | 腾讯科技(深圳)有限公司 | The corresponding method for establishing model of malice account and device, the method and apparatus of malice account identification |
CN106886915A (en) * | 2017-01-17 | 2017-06-23 | 华南理工大学 | A kind of ad click predictor method based on time decay sampling |
JP2019054715A (en) * | 2017-09-15 | 2019-04-04 | 東京電力ホールディングス株式会社 | Power theft monitoring system, power theft monitoring device, power theft monitoring method and program |
CN108109015A (en) * | 2017-12-29 | 2018-06-01 | 广州品唯软件有限公司 | A kind of marketing selective analysis method and device |
US20200084219A1 (en) * | 2018-09-06 | 2020-03-12 | International Business Machines Corporation | Suspicious activity detection in computer networks |
CN109359848A (en) * | 2018-10-09 | 2019-02-19 | 烟台海颐软件股份有限公司 | A kind of extremely relevant electricity consumer recognition methods of line loss and system |
CN109818921A (en) * | 2018-12-14 | 2019-05-28 | 微梦创科网络科技(中国)有限公司 | A kind of analysis method and device of the improper flow of website interface |
CN110288114A (en) * | 2019-03-22 | 2019-09-27 | 国网浙江省电力有限公司信息通信分公司 | Violation electricity consumption behavior prediction method based on power marketing data |
CN111159399A (en) * | 2019-12-13 | 2020-05-15 | 天津大学 | Automobile vertical website water army discrimination method |
CN110988422A (en) * | 2019-12-19 | 2020-04-10 | 北京中电普华信息技术有限公司 | Electricity stealing identification method and device and electronic equipment |
CN111275416A (en) * | 2020-01-15 | 2020-06-12 | 中国人民解放军国防科技大学 | Digital currency abnormal transaction detection method and device, electronic equipment and medium |
CN111368254A (en) * | 2020-03-02 | 2020-07-03 | 西安邮电大学 | Multi-view data missing completion method for multi-manifold regularization non-negative matrix factorization |
CN111507377A (en) * | 2020-03-24 | 2020-08-07 | 微梦创科网络科技(中国)有限公司 | Number maintenance account number batch identification method and device |
CN111507611A (en) * | 2020-04-15 | 2020-08-07 | 北京中电普华信息技术有限公司 | Method and system for determining electricity stealing suspected user |
CN111984695A (en) * | 2020-07-21 | 2020-11-24 | 微梦创科网络科技(中国)有限公司 | Method and system for determining black grouping based on Spark |
CN112000711A (en) * | 2020-07-21 | 2020-11-27 | 微梦创科网络科技(中国)有限公司 | Method and system for determining evaluation user based on Spark |
CN112084229A (en) * | 2020-07-27 | 2020-12-15 | 北京市燃气集团有限责任公司 | Method and device for identifying abnormal gas consumption behaviors of town gas users |
CN112115324A (en) * | 2020-08-10 | 2020-12-22 | 微梦创科网络科技(中国)有限公司 | Method and device for confirming praise-refreshing user based on power law distribution |
CN112149036A (en) * | 2020-09-28 | 2020-12-29 | 微梦创科网络科技(中国)有限公司 | Method and system for identifying batch abnormal interaction behaviors |
CN112148947A (en) * | 2020-09-28 | 2020-12-29 | 微梦创科网络科技(中国)有限公司 | Method and system for mining and reviewing users in batches |
Non-Patent Citations (5)
Title |
---|
刘勘等: "基于随机森林分类的微博机器用户识别研究", 北京大学学报(自然科学版), pages 289 - 300 * |
张艳梅等: "基于贝叶斯模型的微博网络水军识别算法研究", 通信学报, pages 44 - 53 * |
林永成: "社交网络机器用户甄别技术研究与应用", 中国优秀硕士学位论文全文数据库 (信息科技辑), pages 139 - 144 * |
王军博: "基于电商评论的网络水军识别", 中国优秀硕士学位论文全文数据库 (信息科技辑), pages 138 - 823 * |
谢忠红等: "基于逻辑回归算法的微博水军识别", 微型机与应用, pages 67 - 69 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9396247B2 (en) | Method and device for processing a time sequence based on dimensionality reduction | |
CN108462605B (en) | Data prediction method and device | |
CN107070940B (en) | Method and device for judging malicious login IP address from streaming login log | |
CN106874165B (en) | Webpage detection method and device | |
CN111176578B (en) | Object aggregation method, device and equipment and readable storage medium | |
CN111506828B (en) | Batch real-time identification method and device for abnormal attention behaviors | |
KR101132450B1 (en) | Realtime rush keyword and adaptive system | |
WO2018153210A1 (en) | Method, device and database system for use in automatically creating indexes | |
CN109754854B (en) | Method and system for matching diagnosis codes and diagnosis names | |
CN111324705B (en) | System and method for adaptively adjusting associated search terms | |
CN109408556B (en) | Abnormal user identification method and device based on big data, electronic equipment and medium | |
CN112861128A (en) | Method and system for identifying machine accounts in batches | |
CN109542909B (en) | Method and system for identifying associative storage devices in big data storage system | |
CN111858108A (en) | Hard disk fault prediction method and device, electronic equipment and storage medium | |
CN114325232B (en) | Fault positioning method and device | |
CN114650239B (en) | Data brushing amount identification method, storage medium and electronic equipment | |
CN112149036B (en) | Method and system for identifying batch abnormal interaction behaviors | |
CN114218134A (en) | Method and device for caching users | |
CN112148947B (en) | Method and system for excavating and brushing users in batches | |
CN112000711A (en) | Method and system for determining evaluation user based on Spark | |
CN112149037B (en) | Method and system for identifying abnormal attention in real time based on logistic regression | |
CN116776310B (en) | Automatic user account identification method and device, computer equipment and storage medium | |
CN111353860A (en) | Product information pushing method and system | |
CN111026958B (en) | Method and device for ordering hot microblogs | |
CN114218164A (en) | Data anomaly detection method and system based on time sequence vector retrieval |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |