CN112861128A - Method and system for identifying machine accounts in batches - Google Patents

Method and system for identifying machine accounts in batches Download PDF

Info

Publication number
CN112861128A
CN112861128A CN202110083543.4A CN202110083543A CN112861128A CN 112861128 A CN112861128 A CN 112861128A CN 202110083543 A CN202110083543 A CN 202110083543A CN 112861128 A CN112861128 A CN 112861128A
Authority
CN
China
Prior art keywords
account
key
behavior
user
goodness
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110083543.4A
Other languages
Chinese (zh)
Inventor
王嘉伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Weimeng Chuangke Network Technology China Co Ltd
Original Assignee
Weimeng Chuangke Network Technology China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Weimeng Chuangke Network Technology China Co Ltd filed Critical Weimeng Chuangke Network Technology China Co Ltd
Priority to CN202110083543.4A priority Critical patent/CN112861128A/en
Publication of CN112861128A publication Critical patent/CN112861128A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/566Dynamic detection, i.e. detection performed at run-time, e.g. emulation, suspicious activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/21Indexing scheme relating to G06F21/00 and subgroups addressing additional information or applications relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/2141Access rights, e.g. capability lists, access control lists, access tables, access matrices

Abstract

The embodiment of the invention provides a method and a system for identifying machine accounts in batches.A computing engine Spark periodically acquires a user behavior log of login accounts in the previous period from a database, and extracts all accounts of which the behavior number exceeds a preset number threshold in the previous period; acquiring the occurrence time of all key behaviors of each account in the previous period and forming an elastic data set of each account; sequentially counting the number of key behaviors occurring in each length time period according to a preset length time period; fitting the change relation of the key behavior quantity of any account within each length time period along with time by adopting a linear regression equation to obtain a linear regression fitting curve of the account; calculating the goodness of fit of the key behavior data corresponding to each account according to a linear regression fitting curve; and judging whether each account is a machine account in batch according to the goodness of fit of the key behavior data of each account. And searching key behaviors of the account number based on Spark, and reducing the accidental injury rate of the non-machine account number.

Description

Method and system for identifying machine accounts in batches
Technical Field
The invention relates to the field of computers, in particular to a method and a system for identifying machine accounts in batches.
Background
In a modern internet social platform of social media, a large number of lawless persons log in some accounts in batch by using scripts to perform illegal operations such as swiping amounts and the like, and the accounts generally have no substantial content, so that negative effects are brought to normal use of users, and certain challenges are brought to fairness of the platform. Therefore, the machine accounts logged in batch by using the script need to be found in batch.
In the process of implementing the invention, the applicant finds that at least the following problems exist in the prior art: the prior art generally counts the daily visit amount of each user, and then ranks the visit amount from high to low, and considers that the top 5 percent of users are machine accounts. Although some machine account numbers can be found, the accidental injury rate is high, especially for head account numbers, which is unacceptable for normal users.
Disclosure of Invention
The embodiment of the invention provides a method and a system for identifying machine account numbers in batches.
To achieve the above object, in one aspect, an embodiment of the present invention provides a method for batch identifying machine accounts, including:
a computing engine Spark periodically acquires a user behavior log of login accounts in the previous period from a database, extracts all accounts of which the behavior quantity exceeds a preset quantity threshold value in the previous period, and forms a user account set;
acquiring the occurrence time of all key behaviors of each account in a user account set in a previous period from a user behavior log of a database, and forming an elastic data set of each account according to the occurrence time of all key behaviors of each account;
sequentially counting the number of key behaviors occurring in each length time period according to a preset length time period according to an elastic data set of an account number aiming at any account number in a user account number set; the key behaviors refer to behaviors which are performed by a user in an account number authority range and reach a preset important level;
aiming at any account in the user account set, fitting the time-varying relation of the key behavior quantity of the account in each length time period by adopting a linear regression equation to obtain a linear regression fitting curve of the key behavior data of the account; calculating the goodness of fit of key behavior data corresponding to each account according to a linear regression fitting curve of each account in the user account set;
judging whether each account is a machine account in batch according to the goodness of fit of key behavior data of each account in the user account set; the machine account is an account that is registered in batch by using a script to perform an illegal operation.
In another aspect, an embodiment of the present invention provides a system for identifying machine accounts in batches, including a database and a compute engine Spark, where the compute engine Spark includes: the device comprises a key behavior data integration unit, a linear regression unit and a judgment unit, wherein:
the database is used for storing a user behavior log of the login account;
the key behavior data integration unit is used for periodically acquiring a user behavior log of the login account in the previous period from the database, extracting all accounts of which the behavior quantity exceeds a preset quantity threshold value in the previous period, and forming a user account set; acquiring the occurrence time of all key behaviors of each account in a user account set in a previous period from a user behavior log of a database, and forming an elastic data set of each account according to the occurrence time of all key behaviors of each account;
sequentially counting the number of key behaviors occurring in each length time period according to a preset length time period according to an elastic data set of an account number aiming at any account number in a user account number set; the key behaviors refer to behaviors which are performed by a user in an account number authority range and reach a preset important level;
the linear regression unit is used for fitting the time-varying relation of the key behavior quantity of the account in each length time period by adopting a linear regression equation aiming at any account in the user account set to obtain a linear regression fitting curve of the key behavior data of the account; calculating the goodness of fit of key behavior data corresponding to each account according to a linear regression fitting curve of each account in the user account set;
the judging unit is used for judging whether each account is a machine account in batch according to the goodness of fit of key behavior data of each account in the user account set; the machine account is an account that is registered in batch by using a script to perform an illegal operation.
The technical scheme has the following beneficial effects: the key behaviors of the account numbers are searched based on Spark, linear regression is adopted for fitting according to the change of key behavior data to obtain a fitting curve, the fitting goodness of the fitting curve is calculated, the machine account numbers can be identified in batches, the machine account numbers with low frequency can be found out by screening the key behaviors, the finding rate of the machine account numbers with low frequency is improved, meanwhile, the accidental injury rate of the non-machine account numbers is reduced, and the work of automatically identifying the machine account numbers in batches can be achieved through Spark.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flowchart of a method for batch identification of machine accounts according to an embodiment of the present invention;
fig. 2 is a system configuration diagram for batch recognition of machine accounts according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1, in combination with the embodiment of the present invention, there is provided a method for batch identification of machine accounts, including:
s101: a computing engine Spark periodically acquires a user behavior log of login accounts in the previous period from a database, extracts all accounts of which the behavior quantity exceeds a preset quantity threshold value in the previous period, and forms a user account set;
acquiring the occurrence time of all key behaviors of each account in a user account set in a previous period from a user behavior log of a database, and forming an elastic data set of each account according to the occurrence time of all key behaviors of each account;
sequentially counting the number of key behaviors occurring in each length time period according to a preset length time period according to an elastic data set of an account number aiming at any account number in a user account number set; the key behaviors refer to behaviors which are performed by a user in an account number authority range and reach a preset important level;
s102: aiming at any account in the user account set, fitting the time-varying relation of the key behavior quantity of the account in each length time period by adopting a linear regression equation to obtain a linear regression fitting curve of the key behavior data of the account; calculating the goodness of fit of key behavior data corresponding to each account according to a linear regression fitting curve of each account in the user account set;
s103: judging whether each account is a machine account in batch according to the goodness of fit of key behavior data of each account in the user account set; the machine account is an account that is registered in batch by using a script to perform an illegal operation.
Preferably, in step 101, the obtaining, in the user behavior log of the database, the occurrence time of all key behaviors of each account in the user account set in the previous period, and forming the occurrence time of all key behaviors of each account into an elastic data set of each account specifically includes:
s1011: the computing engine Spark acquires the occurrence time of all key behaviors of each account in a user account set in a previous period in a user behavior log of a database, and forms an intermediate elastic data set comprising the account in which the key behavior occurs and the occurrence time of the key behavior aiming at each key behavior;
s1012: all intermediate elastic data sets of the same account number are obtained through a groupByKey function of a computing engine Sspark, the occurrence time of all key behaviors in each intermediate elastic data set of the account number forms an array, and the account number and the array of the occurrence time of all key behaviors of the account number form the elastic data set of the account number.
Preferably, the method further comprises the following steps:
s1013: after the intermediate elastic data sets are formed, for each intermediate elastic data set, subtracting the starting time of the current period from the occurrence time of the key behavior by using a mapto Pair function of a calculation engine Spark to obtain the relative time of occurrence of each key behavior, and converting the unit of each relative time to obtain the conversion time of occurrence of each key behavior to obtain an optimized intermediate elastic data set; and the optimized intermediate elastic data set is used as an object obtained by the groupByKey function to form an elastic data set of each account. The general trend of the number of the key behaviors occurring in each length time period of the same account is that the number of the key behaviors in the unit conversion time in the period is larger than the number of the key behaviors in the unit relative time.
Preferably, step 102 specifically includes:
s1021: aiming at any account, taking a preset length time period as an independent variable of a linear regression equation, and taking the key behavior quantity of the account as a dependent variable of the linear regression equation to obtain a linear regression fitting curve of the key behavior data of the account;
s1022: calculating the mean square error of the dependent variable estimation value in the linear regression fitting curve of the account key behavior data, and calculating the actual variance of the account key behavior data according to the key behavior quantity of the account in each preset length time period; and taking the ratio of the mean square error to the actual variance of the key behavior data of each account as the goodness of fit of the key behavior data of each account.
Preferably, the steps specifically include:
s1031: comparing the goodness of fit of key behavior data of each account in the user account set with a set goodness threshold in batch;
s1032: when the fitting goodness of the key behavior data of a certain account is greater than or equal to the goodness threshold, determining the account as a machine account; and when the goodness of fit of the key behavior data of a certain account is smaller than a goodness threshold, determining that the account is a non-machine account.
As shown in fig. 2, in combination with the embodiment of the present invention, there is also provided a system for batch recognition of machine accounts, including a database and a compute engine Spark, where the compute engine Spark includes: a key behavior data integration unit 21, a linear regression unit 22, and a judgment unit 23, wherein:
the database is used for storing a user behavior log of the login account;
the key behavior data integration unit 21 is configured to periodically obtain a user behavior log of the login account in the previous period from the database, extract all accounts of which the behavior number exceeds a preset number threshold in the previous period, and form a user account set; acquiring the occurrence time of all key behaviors of each account in a user account set in a previous period from a user behavior log of a database, and forming an elastic data set of each account according to the occurrence time of all key behaviors of each account; sequentially counting the number of key behaviors occurring in each length time period according to a preset length time period according to an elastic data set of an account number aiming at any account number in a user account number set; the key behaviors refer to behaviors which are performed by a user in an account number authority range and reach a preset important level;
the linear regression unit 22 is configured to fit, by using a linear regression equation, a change relationship of the number of the key behaviors of the account in each length time period with time to obtain a linear regression fit curve of the key behavior data of the account, for any account in the user account set; calculating the goodness of fit of key behavior data corresponding to each account according to a linear regression fitting curve of each account in the user account set;
the judging unit 23 is configured to judge whether each account is a machine account in batch according to the goodness of fit of the key behavior data of each account in the user account set; the machine account is an account that is registered in batch by using a script to perform an illegal operation.
Preferably, the key behavior data integration unit 21 includes:
the intermediate elastic data set subunit 211 is configured to obtain, in a user behavior log of the database, occurrence times of all key behaviors of each account in the user account set in a previous period, and form, for each key behavior, an intermediate elastic data set including the account where the key behavior occurs and the occurrence time of the key behavior;
the key behavior data integration subunit 212 is configured to obtain all intermediate elastic data sets of the same account through a groupByKey function of the compute engine Sspark, form an array of occurrence times of all key behaviors in each intermediate elastic data set of the account, and form an elastic data set of the account from the account and the array of occurrence times of all key behaviors of the account.
Preferably, the critical behavior data integration unit 21 further includes:
an intermediate elastic data set optimizing subunit 213, configured to, after the intermediate elastic data sets are formed, obtain, for each intermediate elastic data set, relative time for each key behavior by subtracting the starting time of the current cycle from the occurrence time of the key behavior by using a mapto pair function of a compute engine Spark, and convert a unit of each relative time to obtain a conversion time for each key behavior, so as to obtain an optimized intermediate elastic data set;
the key behavior data integration subunit 21 is specifically configured to use the optimized intermediate elastic data set as an object obtained by the groupByKey function to form an elastic data set of each account.
Preferably, the linear regression unit 22 includes:
the linear fitting subunit 221 is configured to, for any account, use a preset length time period as an independent variable of a linear regression equation, and use the number of key behaviors of the account as a dependent variable of the linear regression equation to obtain a linear regression fitting curve of the key behavior data of the account;
a fitting quality calculation operator unit 222, configured to calculate a mean square error of a dependent variable estimation in a linear regression fitting curve of the account key behavior data, and calculate an actual variance of the account key behavior data according to the number of key behaviors of the account in each preset length time period; and taking the ratio of the mean square error to the actual variance of the key behavior data of each account as the goodness of fit of the key behavior data of each account. Wherein, the time period of the preset length can be a unit conversion time.
Preferably, the judging unit 23 includes:
a comparing subunit 231, configured to compare, in batches, the goodness of fit of the key behavior data of each account in the user account set with a set goodness threshold;
a determining subunit 232, configured to determine that a certain account is a machine account when the goodness of fit of the key behavior data of the account is greater than or equal to a goodness threshold; and when the goodness of fit of the key behavior data of a certain account is smaller than a goodness threshold, determining that the account is a non-machine account.
The embodiment of the invention has the following beneficial effects:
the key behaviors of the account numbers are searched based on Spark, linear regression is adopted for fitting according to the change of key behavior data to obtain a fitting curve, the fitting goodness of the fitting curve is calculated, the machine account numbers can be identified in batches, the machine account numbers with low frequency can be found out by screening the key behaviors, the finding rate of the machine account numbers with low frequency is improved, meanwhile, the accidental injury rate of the non-machine account numbers is reduced, and the work of automatically identifying the machine account numbers in batches can be achieved through Spark.
The above technical solutions of the embodiments of the present invention are described in detail below with reference to specific application examples, and reference may be made to the foregoing related descriptions for technical details that are not described in the implementation process.
The technical terms involved in the invention are explained as follows:
machine account number: in a modern internet social platform of social media, a large number of lawless persons log in some accounts in batch by using scripts to perform illegal operations such as swiping amounts and the like, and the accounts generally have no substantial content, so that negative effects are brought to normal use of users, and certain challenges are brought to fairness of the platform.
And (3) behavior logging: and logs recorded when the internet account performs uplink operation, such as behavior of praise, comment, attention and the like. The information includes operation behavior number, account number, time, target and other information.
The invention relates to a Spark and linear regression-based machine account number batch identification system and method, which can automatically find out machine account numbers logged in batch by using scripts in a batch manner through a data mining and analyzing mode. The method and the system have the advantages that the machine account number with low-frequency access can be found out, the finding rate of the machine account number with low-frequency access is very high, and the accidental injury rate of the whole system is reduced.
The invention relates to a machine account number batch identification system and method based on Spark and linear regression, which adopts the complete technical scheme as follows:
1. for all the user sets U (i.e. user account sets) whose number of behaviors (like, comment, forward) exceeds C on the last day.
2. Querying the time of the key behaviors of all uids in U in yesterday by using Spark's hive query, and forming the time stamps of the key behaviors into an intermediate elastic data set RDD1 with the format of [ uid, t ]; wherein Spark is a calculation engine and is set for the distributed cluster, and hive is a database.
3. Using Spark's mapPair function, the timestamp of t minus yesterday 0 is divided by t0 rounded (3600s is appropriate) to form the optimized intermediate elastic dataset RDD2, formatted as [ uid, h ]. Namely, for the number of the key behaviors occurring in each length of time period of the same account, the general trend is that the number of the key behaviors in the unit conversion time in the period is larger than that in the unit relative time.
4. The h values of the same uid are grouped together using the groupByKey function of Spark to form an elastic data set RDD3 for the account, with the format [ uid, [ h0, h1 … ].
5. For any account in the user account set, sequentially counting the number of key behaviors occurring in each length time period according to the elastic data set of the account and the preset length time period, namely: taking RDD3 out of Spark using Spark's collect function forms array L, for each element in L: and counting the total behavior amount every T0 time, namely obtaining the total behavior amount T0 of the user from 0 to T0 time, the total behavior amount T1 of T0 to 2T0 and the total behavior amount T2 of 2T0 to 3T 0. . . And so on, forming a list T.
6. Aiming at any account in the user account set, fitting the time-varying relation of the key behavior quantity of the account in each length time period by adopting a linear regression equation to obtain a linear regression fitting curve of the key behavior data of the account; and calculating the goodness of fit of the key behavior data corresponding to each account according to the linear regression fitting curve of each account in the user account set.
Performing goodness of fit test, and if the sequences T0, T1 and T2 are almost fixed numbers with little change, performing linear regression on the goodness of fit R2Will be high.
7. Defining a threshold value R0, if R2>R0 and the account number is considered to be a machine account number.
Specific examples are as follows:
for all users with the behavior number of more than 1000 in the last day, how many key behavior records are queried in hive, such as [1:20201010080810,1:20201010080910 … ], indicating that user number 1 initiated the key behavior at 2020101008081020201010080910.
Steps 2 and 3 are then followed by the actual conversion of the timestamp to the hour of the action, i.e. [1:8,1:8 … ];
then, in step 4, all the uids are aggregated together to obtain [ uid: list of hours in which the key behavior is located ] data, namely [1: [8,8,9,9,10,10, 11, 11 ],2: [9,10,18,18,18. ] … ].
For one of the users, assume his behavior list is [0, 0,1, 1,2, 2,3, 3, … 23], and then count the behavior amount every T0 to get T (if T0 is one hour, the length of T is 24):
[2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1]
it can be seen that this account behaves uniformly at any time of day, much like a machine account. Because the line graph of T for the machine account number will be very smooth, resembling a straight line. While normal users are generally not visiting at night, and have a certain time to rest in one day. Behavior will fluctuate significantly following T, that is, if linear regression is used for fitting, the fitting effect of the machine account will be very good; further, if a linear regression fit is performed using a T sequence, the better the fit, the more likely it is an abnormal account number.
Below is R2And (4) calculating.
Figure BDA0002910166830000081
There are many kinds of software that can help us implement the optimization fit, i use here the curve _ fit method of python and scipy packages.
(x) is defined as a straight line y ═ ax + b, then: a
popt,pcov=curve_fit(f,x,T)
The length of x ═ 0,1,2,3 … is defined to be consistent with the length of T.
After executing this statement, popt is loaded with the optimized b and a.
Calculation of goodness of fit R-square:
yvals=f(x)
sum0=0
sum1=0
average=numpy.average(T)
for i in range(len(yvals)):
sum0+=(T[i]-yvals[i])**2
sum1+=(T[i]-average)**2
R2=1-(sum0/sum1)
the result of this user's T is R2About 0.9995, where R0 is 0.98, indicating that R is2>R0 judges the user as the machine user
Looking again at T for a normal user:
[1,0,0,0,0,0,0,0,1,0,2,1,10,10,0,0,0,4,0,19,20,40,40,20];
R2about 0.2, knowing that R2<R0;
The user is determined to be a normal user.
The embodiment of the invention has the following beneficial effects:
the key behaviors of the account numbers are searched based on Spark, linear regression is adopted for fitting according to the change of key behavior data to obtain a fitting curve, the fitting goodness of the fitting curve is calculated, the machine account numbers can be identified in batches, the machine account numbers with low frequency can be found out by screening the key behaviors, the finding rate of the machine account numbers with low frequency is improved, meanwhile, the accidental injury rate of the non-machine account numbers is reduced, and the work of automatically identifying the machine account numbers in batches can be achieved through Spark.
It should be understood that the specific order or hierarchy of steps in the processes disclosed is an example of exemplary approaches. Based upon design preferences, it is understood that the specific order or hierarchy of steps in the processes may be rearranged without departing from the scope of the present disclosure. The accompanying method claims present elements of the various steps in a sample order, and are not intended to be limited to the specific order or hierarchy presented.
In the foregoing detailed description, various features are grouped together in a single embodiment for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments of the subject matter require more features than are expressly recited in each claim. Rather, as the following claims reflect, invention lies in less than all features of a single disclosed embodiment. Thus, the following claims are hereby expressly incorporated into the detailed description, with each claim standing on its own as a separate preferred embodiment of the invention.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. To those skilled in the art; various modifications to these embodiments will be readily apparent, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
What has been described above includes examples of one or more embodiments. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the aforementioned embodiments, but one of ordinary skill in the art may recognize that many further combinations and permutations of various embodiments are possible. Accordingly, the embodiments described herein are intended to embrace all such alterations, modifications and variations that fall within the scope of the appended claims. Furthermore, to the extent that the term "includes" is used in either the detailed description or the claims, such term is intended to be inclusive in a manner similar to the term "comprising" as "comprising" is interpreted when employed as a transitional word in a claim. Furthermore, any use of the term "or" in the specification of the claims is intended to mean a "non-exclusive or".
Those of skill in the art will further appreciate that the various illustrative logical blocks, units, and steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate the interchangeability of hardware and software, various illustrative components, elements, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design requirements of the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present embodiments.
The various illustrative logical blocks, or elements, described in connection with the embodiments disclosed herein may be implemented or performed with a general purpose processor, a digital signal processor, an Application Specific Integrated Circuit (ASIC), a field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general-purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a digital signal processor and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a digital signal processor core, or any other similar configuration.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may be stored in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. For example, a storage medium may be coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC, which may be located in a user terminal. In the alternative, the processor and the storage medium may reside in different components in a user terminal.
In one or more exemplary designs, the functions described above in connection with the embodiments of the invention may be implemented in hardware, software, firmware, or any combination of the three. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media that facilitate transfer of a computer program from one place to another. Storage media may be any available media that can be accessed by a general purpose or special purpose computer. For example, such computer-readable media can include, but is not limited to, RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to carry or store program code in the form of instructions or data structures and which can be read by a general-purpose or special-purpose computer, or a general-purpose or special-purpose processor. Additionally, any connection is properly termed a computer-readable medium, and, thus, is included if the software is transmitted from a website, server, or other remote source via a coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wirelessly, e.g., infrared, radio, and microwave. Such discs (disk) and disks (disc) include compact disks, laser disks, optical disks, DVDs, floppy disks and blu-ray disks where disks usually reproduce data magnetically, while disks usually reproduce data optically with lasers. Combinations of the above may also be included in the computer-readable medium.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (10)

1. A method for batch identification of machine accounts, comprising:
a computing engine Spark periodically acquires a user behavior log of login accounts in the previous period from a database, extracts all accounts of which the behavior quantity exceeds a preset quantity threshold value in the previous period, and forms a user account set;
acquiring the occurrence time of all key behaviors of each account in a user account set in a previous period from a user behavior log of a database, and forming an elastic data set of each account according to the occurrence time of all key behaviors of each account;
sequentially counting the number of key behaviors occurring in each length time period according to a preset length time period according to an elastic data set of an account number aiming at any account number in a user account number set; the key behaviors refer to behaviors which are performed by a user in an account number authority range and reach a preset important level;
aiming at any account in the user account set, fitting the time-varying relation of the key behavior quantity of the account in each length time period by adopting a linear regression equation to obtain a linear regression fitting curve of the key behavior data of the account; calculating the goodness of fit of key behavior data corresponding to each account according to a linear regression fitting curve of each account in the user account set;
judging whether each account is a machine account in batch according to the goodness of fit of key behavior data of each account in the user account set; the machine account is an account that is registered in batch by using a script to perform an illegal operation.
2. The method for batch identification of machine accounts according to claim 1, wherein the step of obtaining occurrence times of all key behaviors of each account in a previous cycle in a user behavior log of a database and forming the occurrence times of all key behaviors of each account into an elastic data set of each account specifically comprises:
the computing engine Spark acquires the occurrence time of all key behaviors of each account in a user account set in a previous period in a user behavior log of a database, and forms an intermediate elastic data set comprising the account in which the key behavior occurs and the occurrence time of the key behavior aiming at each key behavior;
all intermediate elastic data sets of the same account number are obtained through a groupByKey function of a computing engine Sspark, the occurrence time of all key behaviors in each intermediate elastic data set of the account number forms an array, and the account number and the array of the occurrence time of all key behaviors of the account number form the elastic data set of the account number.
3. The method for batch identification of machine accounts according to claim 2, further comprising:
after the intermediate elastic data sets are formed, for each intermediate elastic data set, subtracting the starting time of the current period from the occurrence time of the key behavior by using a mapto Pair function of a calculation engine Spark to obtain the relative time of occurrence of each key behavior, and converting the unit of each relative time to obtain the conversion time of occurrence of each key behavior to obtain an optimized intermediate elastic data set; and the optimized intermediate elastic data set is used as an object obtained by the groupByKey function to form an elastic data set of each account.
4. The method for batch identification of machine accounts according to claim 2, wherein the fitting of the time-dependent variation relationship of the number of the key behaviors of the account in each length time period by using a linear regression equation for any account in the user account set to obtain a linear regression fitting curve of the key behavior data of the account specifically comprises:
aiming at any account, taking a preset length time period as an independent variable of a linear regression equation, and taking the key behavior quantity of the account as a dependent variable of the linear regression equation to obtain a linear regression fitting curve of the key behavior data of the account;
the calculating of the goodness of fit of the key behavior data corresponding to each account according to the linear regression fitting curve of each account in the user account set specifically includes:
calculating the mean square error of the dependent variable estimation value in the linear regression fitting curve of the account key behavior data, and calculating the actual variance of the account key behavior data according to the key behavior quantity of the account in each preset length time period; and taking the ratio of the mean square error to the actual variance of the key behavior data of each account as the goodness of fit of the key behavior data of each account.
5. The method for batch identification of machine accounts according to claim 4, wherein the batch judgment of whether each account is a machine account according to the goodness of fit of the key behavior data of each account specifically comprises:
comparing the goodness of fit of key behavior data of each account in the user account set with a set goodness threshold in batch;
when the fitting goodness of the key behavior data of a certain account is greater than or equal to the goodness threshold, determining the account as a machine account;
and when the goodness of fit of the key behavior data of a certain account is smaller than a goodness threshold, determining that the account is a non-machine account.
6. A system for batch recognition of machine accounts, comprising a database and a compute engine Spark, wherein the compute engine Spark comprises: the device comprises a key behavior data integration unit, a linear regression unit and a judgment unit, wherein:
the database is used for storing a user behavior log of the login account;
the key behavior data integration unit is used for periodically acquiring a user behavior log of the login account in the previous period from the database, extracting all accounts of which the behavior quantity exceeds a preset quantity threshold value in the previous period, and forming a user account set; acquiring the occurrence time of all key behaviors of each account in a user account set in a previous period from a user behavior log of a database, and forming an elastic data set of each account according to the occurrence time of all key behaviors of each account; sequentially counting the number of key behaviors occurring in each length time period according to a preset length time period according to an elastic data set of an account number aiming at any account number in a user account number set; the key behaviors refer to behaviors which are performed by a user in an account number authority range and reach a preset important level;
the linear regression unit is used for fitting the time-varying relation of the key behavior quantity of the account in each length time period by adopting a linear regression equation aiming at any account in the user account set to obtain a linear regression fitting curve of the key behavior data of the account; calculating the goodness of fit of key behavior data corresponding to each account according to a linear regression fitting curve of each account in the user account set;
the judging unit is used for judging whether each account is a machine account in batch according to the goodness of fit of key behavior data of each account in the user account set; the machine account is an account that is registered in batch by using a script to perform an illegal operation.
7. The system for batch identification of machine accounts of claim 6, wherein the key behavior data integration unit comprises:
the middle elastic data set subunit is used for acquiring the occurrence time of all key behaviors of each account in the user account set in the previous period in a user behavior log of the database, and forming a middle elastic data set comprising the account of each key behavior and the occurrence time of the key behavior aiming at each key behavior;
and the key behavior data integration subunit is used for acquiring all intermediate elastic data sets of the same account through a groupByKey function of the computing engine Sspark, forming the occurrence time of all key behaviors in each intermediate elastic data set of the account into an array, and forming the account and the array of the occurrence time of all key behaviors of the account into the elastic data set of the account.
8. The system for batch identification of machine accounts according to claim 7, wherein the key behavior data integration unit further comprises:
the middle elastic data set optimizing subunit is used for obtaining the relative time of each key behavior by subtracting the starting time of the current period from the occurrence time of the key behavior through a mapto Pair function of a calculation engine Spark after the middle elastic data sets are formed, and converting the unit of each relative time to obtain the conversion time of each key behavior to obtain the optimized middle elastic data sets;
the key behavior data integration subunit is specifically configured to use the optimized intermediate elastic data set as an object obtained by the groupByKey function to form an elastic data set of each account.
9. The system for batch identification of machine accounts of claim 7, wherein the linear regression unit comprises:
the linear fitting subunit is used for taking a preset length time period as an independent variable of a linear regression equation and taking the key behavior quantity of the account as a dependent variable of the linear regression equation aiming at any account to obtain a linear regression fitting curve of the key behavior data of the account;
a goodness-of-fit calculation subunit, configured to calculate a mean square error of a dependent variable estimated value in a linear regression fitting curve of the account key behavior data, and calculate an actual variance of the account key behavior data according to the number of key behaviors of the account in each preset length period; and taking the ratio of the mean square error to the actual variance of the key behavior data of each account as the goodness of fit of the key behavior data of each account.
10. The system for batch identification of machine accounts according to claim 9, wherein the determining unit includes:
the comparison subunit is used for comparing the goodness of fit of the key behavior data of each account in the user account set with a set goodness threshold in batch;
the judging subunit is used for judging that the account is a machine account when the goodness of fit of the key behavior data of the account is greater than or equal to a goodness threshold; and when the goodness of fit of the key behavior data of a certain account is smaller than a goodness threshold, determining that the account is a non-machine account.
CN202110083543.4A 2021-01-21 2021-01-21 Method and system for identifying machine accounts in batches Pending CN112861128A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110083543.4A CN112861128A (en) 2021-01-21 2021-01-21 Method and system for identifying machine accounts in batches

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110083543.4A CN112861128A (en) 2021-01-21 2021-01-21 Method and system for identifying machine accounts in batches

Publications (1)

Publication Number Publication Date
CN112861128A true CN112861128A (en) 2021-05-28

Family

ID=76008938

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110083543.4A Pending CN112861128A (en) 2021-01-21 2021-01-21 Method and system for identifying machine accounts in batches

Country Status (1)

Country Link
CN (1) CN112861128A (en)

Citations (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102571484A (en) * 2011-12-14 2012-07-11 上海交通大学 Method for detecting and finding online water army
CN103839197A (en) * 2014-03-19 2014-06-04 国家电网公司 Method for judging abnormal electricity consumption behaviors of users based on EEMD method
JP2014160344A (en) * 2013-02-19 2014-09-04 Nippon Telegr & Teleph Corp <Ntt> Bot determination device and method and program and numerical value aggregate distribution determination device
JP2015141456A (en) * 2014-01-27 2015-08-03 Kddi株式会社 bot determination device, bot determination method, and program
CN106886915A (en) * 2017-01-17 2017-06-23 华南理工大学 A kind of ad click predictor method based on time decay sampling
CN107305611A (en) * 2016-04-22 2017-10-31 腾讯科技(深圳)有限公司 The corresponding method for establishing model of malice account and device, the method and apparatus of malice account identification
CN108109015A (en) * 2017-12-29 2018-06-01 广州品唯软件有限公司 A kind of marketing selective analysis method and device
US20180234447A1 (en) * 2015-08-07 2018-08-16 Stc.Unm System and methods for detecting bots real-time
CN109359848A (en) * 2018-10-09 2019-02-19 烟台海颐软件股份有限公司 A kind of extremely relevant electricity consumer recognition methods of line loss and system
JP2019054715A (en) * 2017-09-15 2019-04-04 東京電力ホールディングス株式会社 Power theft monitoring system, power theft monitoring device, power theft monitoring method and program
CN109818921A (en) * 2018-12-14 2019-05-28 微梦创科网络科技(中国)有限公司 A kind of analysis method and device of the improper flow of website interface
CN110288114A (en) * 2019-03-22 2019-09-27 国网浙江省电力有限公司信息通信分公司 Violation electricity consumption behavior prediction method based on power marketing data
US20200084219A1 (en) * 2018-09-06 2020-03-12 International Business Machines Corporation Suspicious activity detection in computer networks
CN110988422A (en) * 2019-12-19 2020-04-10 北京中电普华信息技术有限公司 Electricity stealing identification method and device and electronic equipment
CN111159399A (en) * 2019-12-13 2020-05-15 天津大学 Automobile vertical website water army discrimination method
CN111275416A (en) * 2020-01-15 2020-06-12 中国人民解放军国防科技大学 Digital currency abnormal transaction detection method and device, electronic equipment and medium
CN111368254A (en) * 2020-03-02 2020-07-03 西安邮电大学 Multi-view data missing completion method for multi-manifold regularization non-negative matrix factorization
CN111507611A (en) * 2020-04-15 2020-08-07 北京中电普华信息技术有限公司 Method and system for determining electricity stealing suspected user
CN111507377A (en) * 2020-03-24 2020-08-07 微梦创科网络科技(中国)有限公司 Number maintenance account number batch identification method and device
CN111984695A (en) * 2020-07-21 2020-11-24 微梦创科网络科技(中国)有限公司 Method and system for determining black grouping based on Spark
CN112000711A (en) * 2020-07-21 2020-11-27 微梦创科网络科技(中国)有限公司 Method and system for determining evaluation user based on Spark
CN112084229A (en) * 2020-07-27 2020-12-15 北京市燃气集团有限责任公司 Method and device for identifying abnormal gas consumption behaviors of town gas users
CN112115324A (en) * 2020-08-10 2020-12-22 微梦创科网络科技(中国)有限公司 Method and device for confirming praise-refreshing user based on power law distribution
CN112149036A (en) * 2020-09-28 2020-12-29 微梦创科网络科技(中国)有限公司 Method and system for identifying batch abnormal interaction behaviors
CN112148947A (en) * 2020-09-28 2020-12-29 微梦创科网络科技(中国)有限公司 Method and system for mining and reviewing users in batches

Patent Citations (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102571484A (en) * 2011-12-14 2012-07-11 上海交通大学 Method for detecting and finding online water army
JP2014160344A (en) * 2013-02-19 2014-09-04 Nippon Telegr & Teleph Corp <Ntt> Bot determination device and method and program and numerical value aggregate distribution determination device
JP2015141456A (en) * 2014-01-27 2015-08-03 Kddi株式会社 bot determination device, bot determination method, and program
CN103839197A (en) * 2014-03-19 2014-06-04 国家电网公司 Method for judging abnormal electricity consumption behaviors of users based on EEMD method
US20180234447A1 (en) * 2015-08-07 2018-08-16 Stc.Unm System and methods for detecting bots real-time
CN107305611A (en) * 2016-04-22 2017-10-31 腾讯科技(深圳)有限公司 The corresponding method for establishing model of malice account and device, the method and apparatus of malice account identification
CN106886915A (en) * 2017-01-17 2017-06-23 华南理工大学 A kind of ad click predictor method based on time decay sampling
JP2019054715A (en) * 2017-09-15 2019-04-04 東京電力ホールディングス株式会社 Power theft monitoring system, power theft monitoring device, power theft monitoring method and program
CN108109015A (en) * 2017-12-29 2018-06-01 广州品唯软件有限公司 A kind of marketing selective analysis method and device
US20200084219A1 (en) * 2018-09-06 2020-03-12 International Business Machines Corporation Suspicious activity detection in computer networks
CN109359848A (en) * 2018-10-09 2019-02-19 烟台海颐软件股份有限公司 A kind of extremely relevant electricity consumer recognition methods of line loss and system
CN109818921A (en) * 2018-12-14 2019-05-28 微梦创科网络科技(中国)有限公司 A kind of analysis method and device of the improper flow of website interface
CN110288114A (en) * 2019-03-22 2019-09-27 国网浙江省电力有限公司信息通信分公司 Violation electricity consumption behavior prediction method based on power marketing data
CN111159399A (en) * 2019-12-13 2020-05-15 天津大学 Automobile vertical website water army discrimination method
CN110988422A (en) * 2019-12-19 2020-04-10 北京中电普华信息技术有限公司 Electricity stealing identification method and device and electronic equipment
CN111275416A (en) * 2020-01-15 2020-06-12 中国人民解放军国防科技大学 Digital currency abnormal transaction detection method and device, electronic equipment and medium
CN111368254A (en) * 2020-03-02 2020-07-03 西安邮电大学 Multi-view data missing completion method for multi-manifold regularization non-negative matrix factorization
CN111507377A (en) * 2020-03-24 2020-08-07 微梦创科网络科技(中国)有限公司 Number maintenance account number batch identification method and device
CN111507611A (en) * 2020-04-15 2020-08-07 北京中电普华信息技术有限公司 Method and system for determining electricity stealing suspected user
CN111984695A (en) * 2020-07-21 2020-11-24 微梦创科网络科技(中国)有限公司 Method and system for determining black grouping based on Spark
CN112000711A (en) * 2020-07-21 2020-11-27 微梦创科网络科技(中国)有限公司 Method and system for determining evaluation user based on Spark
CN112084229A (en) * 2020-07-27 2020-12-15 北京市燃气集团有限责任公司 Method and device for identifying abnormal gas consumption behaviors of town gas users
CN112115324A (en) * 2020-08-10 2020-12-22 微梦创科网络科技(中国)有限公司 Method and device for confirming praise-refreshing user based on power law distribution
CN112149036A (en) * 2020-09-28 2020-12-29 微梦创科网络科技(中国)有限公司 Method and system for identifying batch abnormal interaction behaviors
CN112148947A (en) * 2020-09-28 2020-12-29 微梦创科网络科技(中国)有限公司 Method and system for mining and reviewing users in batches

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
刘勘等: "基于随机森林分类的微博机器用户识别研究", 北京大学学报(自然科学版), pages 289 - 300 *
张艳梅等: "基于贝叶斯模型的微博网络水军识别算法研究", 通信学报, pages 44 - 53 *
林永成: "社交网络机器用户甄别技术研究与应用", 中国优秀硕士学位论文全文数据库 (信息科技辑), pages 139 - 144 *
王军博: "基于电商评论的网络水军识别", 中国优秀硕士学位论文全文数据库 (信息科技辑), pages 138 - 823 *
谢忠红等: "基于逻辑回归算法的微博水军识别", 微型机与应用, pages 67 - 69 *

Similar Documents

Publication Publication Date Title
US9396247B2 (en) Method and device for processing a time sequence based on dimensionality reduction
CN108462605B (en) Data prediction method and device
CN107070940B (en) Method and device for judging malicious login IP address from streaming login log
CN106874165B (en) Webpage detection method and device
CN111176578B (en) Object aggregation method, device and equipment and readable storage medium
CN111506828B (en) Batch real-time identification method and device for abnormal attention behaviors
KR101132450B1 (en) Realtime rush keyword and adaptive system
WO2018153210A1 (en) Method, device and database system for use in automatically creating indexes
CN109754854B (en) Method and system for matching diagnosis codes and diagnosis names
CN111324705B (en) System and method for adaptively adjusting associated search terms
CN109408556B (en) Abnormal user identification method and device based on big data, electronic equipment and medium
CN112861128A (en) Method and system for identifying machine accounts in batches
CN109542909B (en) Method and system for identifying associative storage devices in big data storage system
CN111858108A (en) Hard disk fault prediction method and device, electronic equipment and storage medium
CN114325232B (en) Fault positioning method and device
CN114650239B (en) Data brushing amount identification method, storage medium and electronic equipment
CN112149036B (en) Method and system for identifying batch abnormal interaction behaviors
CN114218134A (en) Method and device for caching users
CN112148947B (en) Method and system for excavating and brushing users in batches
CN112000711A (en) Method and system for determining evaluation user based on Spark
CN112149037B (en) Method and system for identifying abnormal attention in real time based on logistic regression
CN116776310B (en) Automatic user account identification method and device, computer equipment and storage medium
CN111353860A (en) Product information pushing method and system
CN111026958B (en) Method and device for ordering hot microblogs
CN114218164A (en) Data anomaly detection method and system based on time sequence vector retrieval

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination