WO2021012509A1 - Method, device, and computer storage medium for detecting abnormal account - Google Patents

Method, device, and computer storage medium for detecting abnormal account Download PDF

Info

Publication number
WO2021012509A1
WO2021012509A1 PCT/CN2019/117581 CN2019117581W WO2021012509A1 WO 2021012509 A1 WO2021012509 A1 WO 2021012509A1 CN 2019117581 W CN2019117581 W CN 2019117581W WO 2021012509 A1 WO2021012509 A1 WO 2021012509A1
Authority
WO
WIPO (PCT)
Prior art keywords
access data
target
data
account
access
Prior art date
Application number
PCT/CN2019/117581
Other languages
French (fr)
Chinese (zh)
Inventor
侯明远
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021012509A1 publication Critical patent/WO2021012509A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/10Network architectures or network communication protocols for network security for controlling access to devices or network resources
    • H04L63/101Access control lists [ACL]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1425Traffic logging, e.g. anomaly detection

Definitions

  • This application relates to the field of Internet technology, and in particular to an abnormal account detection method, device and computer storage medium.
  • the first method is to manually set up a cordon. If an account accesses the company’s key uniform resource locator (URL) for more than this cordon, the account’s permissions will be cancelled or pulled into the black Among the list.
  • the limitation of this method is that it is difficult to set a "correct” warning line, resulting in some illegal accounts that cannot be found.
  • the second method is to use machine learning to generate classifiers using supervised learning methods by labeling "violation” and "non-violation” labels.
  • the embodiments of the present application provide an abnormal account detection method, device, and computer storage medium, which can accurately and efficiently detect abnormally accessed accounts.
  • the embodiment of the present application also provides an abnormal account detection method, including:
  • each of the N sets of access data is M-dimensional data, and N and M are both positive integers;
  • Gaussian kernel function to map the data of the N groups of access data in each dimension to the interval [0,1];
  • the account corresponding to the abnormally distributed access data in the [0,1] interval is determined to be an abnormal account, and the abnormally distributed access data is an access that is different from the access data distribution of at least half of the N accounts data.
  • the embodiment of the present application also provides an abnormal account detection device, including:
  • the obtaining unit is configured to obtain N sets of access data for N accounts to access the target website within a preset time, each of the N sets of access data is M-dimensional data, and N and M are both positive integers;
  • a mapping unit for mapping the data of the N groups of access data in each dimension to the interval [0,1] by using a Gaussian kernel function
  • the determining unit is configured to determine the account corresponding to the abnormally distributed access data in the [0,1] interval as the abnormal account, and the abnormally distributed access data is access to at least half of the N accounts The data is distributed with different access data.
  • the embodiment of the present application also provides a computer device for executing the abnormal account detection method provided in the first aspect.
  • the computer device may include a processor, a communication interface and a memory, and the processor, the communication interface and the memory are connected to each other.
  • the communication interface is used to communicate with other network devices (such as terminals), the memory is used to store the implementation code of the above abnormal account detection method, and the processor is used to execute the program code stored in the memory, that is, the above abnormal account detection method is executed.
  • the embodiment of the present application also provides a computer non-volatile readable storage medium, the non-volatile readable storage medium stores instructions, and when it runs on the processor, the processor executes the above abnormal account detection method .
  • the embodiment of the present application also provides a computer program product containing instructions, which when running on a processor, causes the processor to execute the above abnormal account detection method.
  • the Gaussian kernel algorithm is used to automatically identify the abnormal accounts in an unsupervised way, and for a certain access data, it can be layered from different time dimensions (for example, statistics for one minute).
  • time dimensions for example, statistics for one minute.
  • a feature is divided into multiple dimensions according to the time dimension, and then high-dimensional space mapping is performed to find the abnormal account more accurately.
  • FIG. 1 is a schematic diagram of the hardware structure of a computer device provided by an embodiment of the present application
  • FIG. 2 is a schematic flowchart of an abnormal account detection method provided by an embodiment of the present application.
  • FIG. 3A is a schematic diagram of a one-dimensional space provided by an embodiment of the present application.
  • 3B is a schematic diagram of another multi-dimensional space provided by an embodiment of the present application.
  • Fig. 4 is a schematic diagram of the logical structure of a computer device provided by an embodiment of the present application.
  • FIG. 1 shows a computer device provided by an embodiment of the present application.
  • the computer device 100 may include a memory 101, a communication interface 102, and one or more processors 103. These components can be connected through the bus 104 or in other ways.
  • FIG. 1 uses the bus connection as an example. among them:
  • the memory 101 may be coupled with the processor 103 through a bus 104 or an input/output port, and the memory 101 may also be integrated with the processor 103.
  • the memory 101 is used to store various software programs and/or multiple sets of instructions.
  • the memory 101 may include a high-speed random access memory, and may also include a non-volatile memory, such as one or more magnetic disk storage devices, flash memory devices, or other non-volatile solid-state storage devices.
  • the memory 101 may also store a network communication program, which may be used to communicate with one or more additional devices, one or more terminals, and one or more network devices.
  • the processor 103 may be a general-purpose processor, such as a central processing unit (CPU), a digital signal processing (DSP), an application specific integrated circuit (ASIC), or One or more integrated circuits configured to implement the embodiments of the present application.
  • the processor 103 can process data received through the communication interface 102.
  • the communication interface 102 is used for the computer device 100 to communicate with other network devices, such as a terminal for communication.
  • the communication interface 102 may be a transceiver, a transceiver circuit, etc., where the communication interface is a general term and may include one or more interfaces, such as an interface between a terminal and a server.
  • the communication interface 102 may include a wired interface and a wireless interface, such as a standard interface, Ethernet, and a multi-machine synchronization interface.
  • the processor 103 can be used to read and execute computer-readable instructions. Specifically, the processor 103 may be used to call data stored in the memory 101. Optionally, when the processor 103 sends any message or data, it specifically drives or controls the communication interface 102 to do the sending. Optionally, when the processor 103 receives any message or data, it specifically drives or controls the communication interface 102 to make the reception. Therefore, the processor 103 can be regarded as a control center that performs sending or receiving, and the communication interface 102 is a specific performer of sending and receiving operations.
  • the communication interface 102 is specifically configured to perform the data transceiving steps involved in the following method embodiments, and the processor 103 is specifically configured to implement data processing steps other than data transceiving.
  • the computer device 100 may be a server or terminal device with computing or processing capabilities.
  • FIG. 2 provides an abnormal account detection method related to an embodiment of the present application.
  • the abnormal account detection method includes but is not limited to the following steps S201-S203.
  • Step S201 Obtain N sets of access data for N accounts to access the target website within a preset time, each of the N sets of access data is M-dimensional data, and N and M are both positive integers.
  • the access data includes one or more of access time or access times.
  • the type of access to the target website may include, but is not limited to: one or more of login, inquiry, retrieval, or insurance tracking.
  • the target website can be, for example, the insurance system website (or URL) that is logged in, the insurance system website (or URL) corresponding to the inquiry, the corresponding insurance system website (or URL), or the insurance system website (or URL) corresponding to insurance tracking, etc. .
  • the access type is login, the access data may be the access time and/or the number of accesses to a certain website (or URL).
  • the access type is an inquiry, the access data may be the access time and/or the number of times of access to a certain inquiry website (or URL) for inquiry.
  • the access data may be the access time and/or the number of times of access to a certain retrieval website (or URL) for retrieval. If the access type is insurance tracking, the access data may be the access time and/or the number of visits to a certain insurance tracking website (or URL) for insurance tracking.
  • the aforementioned N accounts may be all accounts that access the target website.
  • the preset time can be one second, one minute, one hour, one day or one week, etc. You can also set the preset time of the working period and the preset time of the non-working period respectively.
  • the working time period includes 7:00-23:59
  • the non-working time period includes 00:00-6:59.
  • the preset time corresponding to the working time period and the preset time corresponding to the non-working time period may be different.
  • the preset time corresponding to the working time period may be 1 second, and the preset time corresponding to the non-working time period is 1 minute.
  • the preset time can be one or multiple.
  • the statistical time periods corresponding to the same preset time of different accounts can be the same or different.
  • For a certain type of access count the access data of N accounts within a preset time. For example, if the access type is the login target URL (such as the website of a certain insurance system), then the statistics (such as the number of visits) of the N accounts logged in to the target URL within a preset time (such as one minute) are counted. In the statistics, the corresponding minute of each account can be different, and the minute that an account has the most access times in a day can be confirmed as the account's statistical time. For example, suppose that account 1 has the most number of visits to the target URL within a minute of 10:00-10:01 in a day, then the number of times that account 1 visits the target URL within a minute of 10:00-10:01 is used as the account 1 statistics.
  • the number of times that account 2 accesses the target URL within a minute of 11:00-11:01 is taken as account 2’s Statistical data.
  • the aforementioned methods can be used to obtain access data of each account within a preset time. For example, count the number of inquiries made by N accounts in one minute, count the number of searches made by N accounts in one minute, and so on.
  • each preset time may be different. For example, you can count the number of times N accounts log in to the target URL in one second, the number of times N accounts log in to the target URL in one minute, the number of times N accounts log in to UPR in one hour, and the number of times N accounts log in in one day. The number of target URLs.
  • the preset time can be different for different access types.
  • the preset time can be one second
  • the preset time can be one hour
  • the preset time can be one day, etc. Wait.
  • the access data may be data stored in the system database, or statistics based on the received access request.
  • the user can use the account to access the related URL link, and a certain account will carry the corresponding access address when requesting access.
  • the system receives an access request from an account, it records the access time, access address, and the identification of the account.
  • Step S202 Use a Gaussian kernel function to map the data of the N groups of access data in each dimension to the interval [0,1].
  • the Gaussian kernel is a non-linear mathematical method for measuring similarity.
  • anomalous points in the corresponding dimension can be identified.
  • the characteristic dimensions include: access time, number of visits, ratio of inquiring orders, retrieval orders, or insurance tracking orders, etc.
  • a Gaussian check is used to classify N accounts. Since the access data of N accounts is not linearly separable, a Gaussian kernel needs to be used to map the data set to a high-dimensional space. In this way, the data is linearly separable in the high-dimensional space.
  • the access data can be one-dimensional or multi-dimensional.
  • the use of a Gaussian kernel function to map the data of the N groups of access data in each dimension to the interval [0,1] includes:
  • a hinge function is used to perform a preliminary transformation on the N sets of access data. If the target access data in the N sets of access data is greater than the mode of the N access data, the value of the target access data is set to all The difference between the target access data and the mode;
  • the value of the target access data is set to 0;
  • the determining the account corresponding to the abnormally distributed access data in the interval [0,1] as the abnormal account includes:
  • the account number corresponding to the access data whose distance from the zero value in the one-dimensional space is less than or equal to the first preset distance is determined as the abnormal account number.
  • the mode refers to the numerical value that appears most frequently among the N access data.
  • the first preset value may be, for example, one-half or one-third.
  • the first preset distance can be set manually or set by default.
  • Gaussian kernel transformation involves center value and scale parameters. Take the mode of the processed value as the center value, and take the difference between the maximum and minimum values of the processed values * 1/2 (Or one-third) get the scale parameter of the Gaussian kernel transform, and after these two parameters are obtained, the Gaussian kernel transform can be processed. After Gaussian kernel transformation processing, a one-dimensional space map is obtained. All values will be distributed between 0-1.
  • the abnormal point can be judged according to the distance of the target point from 0.
  • the use of Gaussian kernel function to map the data of the N groups of access data in each dimension to the interval [0,1] includes:
  • a hinge function is used to perform a preliminary transformation of the N sets of access data in each dimension, if the value of the target access data in the N sets of access data in the target dimension is greater than the value of the N access data in the target dimension
  • the mode of the target access data, the value of the target access data in the target dimension is set as the difference between the value of the target access data in the target dimension and the mode;
  • the target access data in the N sets of access data in the target dimension is less than or equal to the mode of the N access data in the target dimension, then the target access data in the target dimension The value is set to 0;
  • a multi-dimensional space is a hypercube composed of multiple [0,1] intervals, that is, the N groups of access data are all distributed between the values [0,...,0] and [1,...,1], the [0, ...,0] and the [1,...,1] are all M-dimensional data;
  • the determining the account corresponding to the abnormally distributed access data in the interval [0,1] as the abnormal account includes:
  • the account corresponding to the access data whose Euclidean distance between the Euclidean distance and the space base point is greater than or equal to the second preset distance in the multidimensional space is determined as an abnormal account, and the space base point is a point with a value of [1,...,1],
  • the [1,...,1] is M-dimensional data.
  • the second preset value may be, for example, one-half or one-third.
  • the second preset distance can be manually set, or the device can be set by default.
  • the Euclidean distance between the target point and the space base point ie [1,...,1]
  • FIG. 3A it is a schematic diagram of a one-dimensional space provided by an embodiment of the present application.
  • One coordinate axis of the one-dimensional space represents the distribution value, and the distribution value can normalize the original data (that is, the number of logins obtained by statistics) to an interval of 0-1.
  • the access data corresponding to the multiple access types of N accounts can be mapped to a multidimensional space.
  • the number of times N accounts have logged in (visited) the target URL in 1 minute and the number of times N accounts have inquired in 1 minute for example, it can be the number of times the insurance system website (or URL) corresponding to the inquiry has been logged in
  • These two dimensions create a two-dimensional space.
  • One axis in the two-dimensional space represents the number of logins to the target URL within 1 minute, and the other axis represents the number of inquiries within 1 minute.
  • the two-dimensional space may be as shown in FIG. 3B, for example. In Figure 3B, the number of logins and the number of inquiries are both normalized to the 0-1 interval.
  • the three dimensions of the number of times that an account is retrieved within 1 minute are used to establish a three-dimensional space.
  • One axis in the three-dimensional space represents the number of logins to the target URL within 1 minute
  • the other axis represents the number of inquiries within 1 minute
  • the other axis represents the number of retrievals within 1 minute.
  • the more access types are set, the more dimensions of the multidimensional space that can be mapped.
  • the access data of N accounts in different preset times may also be mapped into a multi-dimensional space.
  • the two dimensions of the number of times N accounts have logged in to the target URL in 1 second and the number of times N accounts have logged in to the target URL in 1 minute can be used to establish a multidimensional space.
  • One axis in the multidimensional space represents the number of logins to the target URL in 1 second, and the other axis represents the number of logins to the target URL in 1 minute.
  • the two-dimensional space may be as shown in FIG. 3B, for example. In FIG.
  • the number of logins to the target URL within 1 second and the number of logins to the target URL within 1 minute are both normalized to a range of 0-1.
  • the number of times N accounts log in to the target URL in 1 second
  • the number of times N accounts log in to the target URL in 1 minute
  • the number of times N accounts log in to the target URL in 1 hour to build a multi-dimensional space.
  • One axis in the multidimensional space represents the number of logins to the target URL within 1 second
  • the other axis represents the number of logins to the target URL within 1 minute
  • the other axis represents the number of logins to the target URL within 1 hour.
  • the more the number of preset times is set, the more the number of dimensions of the multidimensional space that can be mapped.
  • the access data of multiple access types in multiple preset times maps the access data of multiple access types in multiple preset times to a multidimensional space.
  • the number of times N accounts log in to the target URL in 1 second, the number of times N accounts log in to the target URL in 1 minute, and the number of times N accounts have inquired in 1 minute for example, can be corresponding to login inquiry
  • the four dimensions of the number of insurance system URLs (or URLs) and the number of inquiries for N accounts within 1 hour are four dimensions to create a multidimensional space.
  • the multi-dimensional space can be four-dimensional.
  • One axis represents the number of logins to the target URL within 1 second
  • the other axis represents the number of logins to the target URL within 1 minute
  • the other axis represents the number of inquiries within 1 minute
  • the fourth axis represents 1 The number of inquiries in an hour.
  • Step S203 Determine the account corresponding to the abnormally distributed access data in the [0,1] interval as the abnormal account, and the abnormally distributed access data is the distribution of the access data of at least half of the N accounts Different access data.
  • the number of times the N accounts have logged in to the target URL within 1 minute are all within the range of 0-1. Among them, the closer the account is to 0, the more abnormal, the closer the account is to 1, the more Normally, as shown in Figure 3A, the access times of most accounts are distributed around 1 (for example, the access times of most accounts do not exceed 10,000 times, distributed around 1), while the access times of a few accounts are Distributed near 0 (for example, a small number of accounts have been accessed more than 100,000 times, distributed near 0), it can be characterized that accounts distributed near 0 are abnormal accounts.
  • a multi-dimensional space is obtained, and for each dimension, points with abnormal distribution can be identified.
  • the M-dimensional vector includes: the number of logins to a key URL in one minute during working hours, the maximum number of logins to a key URL in one minute during non-working hours, and work One-minute inquiry quantity within time, one-minute inquiry quantity during non-working hours, one-minute retrieval quantity during working hours, one-minute retrieval quantity within non-working hours, one-minute insurance tracking quantity within working hours, and non-working hours Any number of insurance tracking quantity in one minute during working hours), then N accounts have a total of N*M access data, these N accounts are distributed in the M-dimensional space coordinate system, according to the target point and the space base point The Euclidean distance between ([1,...,1]) determines the anomalous point.
  • a threshold can be set, and if the Euclidean distance from the space base point exceeds the threshold, it is determined to be an abnormal account. As shown in Figure 3B, assuming that most (more than half) of the accounts are distributed near (1,1,1,...,1), and a small number of accounts are distributed near (0,0,0,...,0), The accounts distributed near (0,0,0,...,0) are the accounts accessed abnormally. Or only consider the Euclidean distance of each account from the space base point (1,1,1,...,1). The larger the distance, the higher the anomaly score.
  • a certain type of access it may also be subdivided based on the time dimension. For example, for the access type of login times, you can sequentially count the number of times N accounts log in to a URL within 1 second, the number of times N accounts log in to a URL within 1 minute, and the number of times N accounts log in to a URL within 1 day. Number of times, the number of times that N accounts log in to a certain URL in a week, and 4 sets of visit times are obtained. The 4 sets of visit times are used to establish a 4-dimensional space coordinate to obtain a multidimensional space of N accounts. The respective abnormal points in the 4-dimensional space coordinates are determined as the accounts of abnormal access.
  • each type of access data it can be divided into multiple sets of data based on the time dimension, and finally a multi-dimensional space is created based on multiple sets of data divided into multiple access data, and abnormal points in the multi-dimensional space are confirmed as abnormal accounts.
  • each account is a 5-dimensional vector, and data in two time periods are selected for each feature, for example, within 1 second and within 1 minute, a total of 10*5*2 visits for 10 accounts.
  • these 10 accounts are distributed in a 10-dimensional space coordinate system, determined according to the Euclidean distance between the target point and the space base point (1,1,1,1,1,1,1,1,1)
  • the longer the distance is, the greater the probability of the abnormal point For the abnormal account corresponding to the abnormal point.
  • the abnormal account After the abnormal account is determined, the abnormal account can be added to the blacklist. When the access request initiated by the abnormal account is received next time, the access request is rejected.
  • the operator can Focus on accounts that are suspected of being abnormal.
  • the process of detecting abnormal accounts can be performed periodically. For example, testing once a week, or testing once a month.
  • Gaussian distribution can be performed based on the latest data in the system. For example, if abnormal account detection is performed on January 8, the access data of each account during the week from January 1 to January 7 can be found in the system. Gaussian distribution and abnormal account detection based on the latest access data can be monitored in time The abnormal accounts in the recent period can prevent the privacy data in the system from being stolen or leaked in time.
  • the embodiment of the present application does not need to manually set a "warning line”, nor does it need to provide "abnormal” samples in advance, and can automatically identify accounts with abnormal access behaviors.
  • the traditional method of monitoring abnormal behaviors has been changed.
  • the Gaussian kernel algorithm is used to automatically identify the abnormal accounts in an unsupervised way, and for a certain access data, it can be layered from different time dimensions (for example, statistics for one minute).
  • time dimensions for example, statistics for one minute.
  • a feature is divided into multiple dimensions according to the time dimension, and then high-dimensional space mapping is performed to find the abnormal account more accurately.
  • FIG. 4 shows a schematic structural diagram of an abnormal account detection device.
  • the abnormal account detection device 400 includes: an acquisition unit 401, a mapping unit 402 and a determination unit 403.
  • the obtaining unit 401 obtains N sets of access data for N accounts to access the target website within a preset time, each of the N sets of access data is M-dimensional data, and N and M are both positive integers;
  • the mapping unit 402 is configured to map the data of the N groups of access data in each dimension to the interval [0,1] by using a Gaussian kernel function;
  • the determining unit 403 is configured to determine the account corresponding to the abnormally distributed access data in the [0,1] interval as the abnormal account, and the abnormally distributed access data is related to at least half of the N accounts Access data is distributed with different access data.
  • M is equal to 1, that is, the access data is one-dimensional data; the mapping unit 402 is specifically configured to:
  • a hinge function is used to perform a preliminary transformation on the N sets of access data. If the target access data in the N sets of access data is greater than the mode of the N access data, the value of the target access data is set to all The difference between the target access data and the mode;
  • the value of the target access data is set to 0;
  • the determining unit 403 is specifically configured to:
  • the account corresponding to the access data whose distance from the value 0 in the one-dimensional space is less than or equal to the first preset distance is determined as an abnormal account.
  • M is greater than or equal to 2, that is, the access data is multi-dimensional data; the mapping unit 402 is specifically configured to:
  • a hinge function is used to perform a preliminary transformation of the N sets of access data in each dimension, if the value of the target access data in the N sets of access data in the target dimension is greater than the value of the N access data in the target dimension
  • the mode of the target access data, the value of the target access data in the target dimension is set as the difference between the value of the target access data in the target dimension and the mode;
  • the target access data in the N sets of access data in the target dimension is less than or equal to the mode of the N access data in the target dimension, then the target access data in the target dimension The value is set to 0;
  • the determining unit 403 is specifically configured to:
  • the account corresponding to the access data whose Euclidean distance between the Euclidean distance and the space base point is greater than or equal to the second preset distance in the multidimensional space is determined as an abnormal account, and the space base point is a point with a value of [1,...,1],
  • the [1,...,1] is M-dimensional data.
  • the device 400 further includes:
  • the adding unit is used to add the abnormal account to the blacklist after the determining unit 403 determines that the account corresponding to the abnormally distributed access data in the interval [0,1] is the abnormal account, and the abnormal account is received next time When the access request is initiated by the abnormal account, the access request is rejected.
  • the device 400 further includes:
  • the adding unit is configured to identify the Internet Protocol IP address or domain name of the abnormal account after the determining unit 403 determines the account corresponding to the abnormally distributed access data in the interval [0,1] as the abnormal account, Setting other accounts whose IP address and the IP address of the abnormal account are on the same network segment or whose domain name is the same as the domain name of the abnormal account as abnormal accounts and added to the blacklist;
  • the preset time includes multiple
  • the M-dimensional data corresponding to the target account in the N accounts includes that the target account accesses the target website within the multiple preset times Access data.
  • the preset time includes a preset time corresponding to a working time period and/or a preset time corresponding to a non-working time period; the preset time includes one second, one minute, and one hour , One day or one week.
  • the access type includes inquiry, retrieval or insurance tracking.
  • the acquiring unit 401 is configured to access the access data of the N accounts of the target website within a preset time, including:
  • a computer non-volatile readable storage medium stores a computer program.
  • the computer program includes program instructions. Realized when executed by the processor.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer non-volatile readable storage medium.
  • the technical solution of this application is essentially or the part that contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium It includes several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the method described in each embodiment of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic disk or optical disk and other media that can store program code .
  • the computer may be implemented in whole or in part by software, hardware, firmware or any combination thereof.
  • software it can be implemented in the form of a computer program product in whole or in part.
  • the computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, all or part of the processes or functions described in the embodiments of the present application are generated.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in a computer non-volatile readable storage medium, or transmitted from one computer non-volatile readable storage medium to another computer non-volatile readable storage medium, for example, the computer instructions It can be from one website site, computer, server or data center through wired (such as coaxial cable, optical fiber, digital subscriber line (digital subscriber line, DSL) or wireless (such as infrared, wireless, microwave, etc.) to another website site, Computer, server or data center for transmission.
  • the computer non-volatile readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or data center integrated with one or more available media.
  • the usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, and a magnetic tape), an optical medium (for example, a digital versatile disc (DVD), a semiconductor medium (for example, a solid state disk, SSD), etc.
  • a magnetic medium for example, a floppy disk, a hard disk, and a magnetic tape
  • an optical medium for example, a digital versatile disc (DVD)
  • DVD digital versatile disc
  • SSD solid state disk

Abstract

Disclosed by the embodiments of the present application are a method, device, and computer storage medium for detecting an abnormal account, used for the technical field of computers. The method comprises: obtaining N sets of access data for N accounts to access a target URL within a preset time, each of the N groups of access data being M-dimensional data, N and M both being positive integers; using a Gaussian kernel function to map the data of the N groups of access data in each dimension to the interval [0,1]; determining to be an abnormal account an account corresponding to abnormally distributed access data in the interval [0,1], said abnormally distributed access data being access data which is different from the access data distribution of at least half of the N accounts. Implementing the embodiments of the present application can accurately and efficiently detect abnormally accessed accounts.

Description

一种异常账号检测方法、装置及计算机存储介质Method, device and computer storage medium for detecting abnormal account
本申请要求于2019年07月23日提交中国专利局、申请号为201910669346.3、申请名称为“一种异常账号检测方法、装置及计算机存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on July 23, 2019, the application number is 201910669346.3, and the application name is "An abnormal account detection method, device and computer storage medium", the entire content of which is by reference Incorporated in this application.
技术领域Technical field
本申请涉及互联网技术领域,尤其涉及一种异常账号检测方法、装置及计算机存储介质。This application relates to the field of Internet technology, and in particular to an abnormal account detection method, device and computer storage medium.
背景技术Background technique
如何避免公司关键数据的泄露,确保公司的信息安全,一直以来都是信息安全部门所关心的重大问题。目前传统的防止公司关键数据被窃取的方式有两类。第一种方式是人工设立一条警戒线,如果某一账号访问公司关键统一资源定位符(uniform resource locator,URL)的次数超过了这条警戒线,则取消该账号的权限或将其拉入黑名单之中。这种方法的局限在于,很难设定一条“正确”的警戒线,导致有些“擦边球”的违规账号无法被发现。第二种方式是通过机器学习的方式,通过标注“违规”和“不违规”的标签,用有监督的学习方法生成分类器。这种方式的缺陷在于,在实际操作中,无法明确地标注账号是否“违规”。除此之外,真实违规的账号占所有账号的比例非常之小,对有监督学习的效果影响非常大。现有技术中监测违规账号的方式效果差,精准度低,不能有效查找到异常访问的账号或者违规账号。How to avoid the leakage of the company's key data and ensure the company's information security has always been a major concern of the information security department. At present, there are two traditional ways to prevent the theft of key company data. The first method is to manually set up a cordon. If an account accesses the company’s key uniform resource locator (URL) for more than this cordon, the account’s permissions will be cancelled or pulled into the black Among the list. The limitation of this method is that it is difficult to set a "correct" warning line, resulting in some illegal accounts that cannot be found. The second method is to use machine learning to generate classifiers using supervised learning methods by labeling "violation" and "non-violation" labels. The disadvantage of this approach is that in actual operation, it is impossible to clearly mark whether the account is "violating". In addition, accounts that actually violate the rules account for a very small proportion of all accounts, which has a great impact on the effect of supervised learning. The method of monitoring illegal accounts in the prior art has poor effect and low accuracy, and cannot effectively find abnormally accessed accounts or illegal accounts.
发明内容Summary of the invention
本申请实施例提供了一种异常账号检测方法、装置及计算机存储介质,能够准确高效的检测到异常访问的账号。The embodiments of the present application provide an abnormal account detection method, device, and computer storage medium, which can accurately and efficiently detect abnormally accessed accounts.
本申请实施例还提供了一种异常账号检测方法,包括:The embodiment of the present application also provides an abnormal account detection method, including:
获取N个账号在预设时间内访问目标网址的N组访问数据,所述N组访问数据中的每组访问数据为M维数据,N和M均为正整数;Acquire N sets of access data for N accounts to access the target website within a preset time, each of the N sets of access data is M-dimensional data, and N and M are both positive integers;
采用高斯核函数将所述N组访问数据在每个维度上的数据映射到[0,1]区间;Using Gaussian kernel function to map the data of the N groups of access data in each dimension to the interval [0,1];
将所述[0,1]区间中分布异常的访问数据对应的账号确定为异常账号,所述分布异常的访问数据是与所述N个账号中的至少一半的账号的访问数据分布不同的访问数据。The account corresponding to the abnormally distributed access data in the [0,1] interval is determined to be an abnormal account, and the abnormally distributed access data is an access that is different from the access data distribution of at least half of the N accounts data.
本申请实施例还提供了一种异常账号检测装置,包括:The embodiment of the present application also provides an abnormal account detection device, including:
获取单元,用于获取N个账号在预设时间内访问目标网址的N组访问数据,所述N组访问数据中的每组访问数据为M维数据,N和M均为正整数;The obtaining unit is configured to obtain N sets of access data for N accounts to access the target website within a preset time, each of the N sets of access data is M-dimensional data, and N and M are both positive integers;
映射单元,用于采用高斯核函数将所述N组访问数据在每个维度上的数据映射到[0,1]区间;A mapping unit for mapping the data of the N groups of access data in each dimension to the interval [0,1] by using a Gaussian kernel function;
确定单元,用于将所述[0,1]区间中分布异常的访问数据对应的账号确定为异常账号,所述分布异常的访问数据是与所述N个账号中的至少一半的账号的访问数据分布不同的访问数据。The determining unit is configured to determine the account corresponding to the abnormally distributed access data in the [0,1] interval as the abnormal account, and the abnormally distributed access data is access to at least half of the N accounts The data is distributed with different access data.
本申请实施例还提供了一种计算机设备,用于执行第一方面所提供的异常账号检测方法。该计算机设备可包括:处理器、通信接口和存储器,处理器、通信接口和存储器相互连接。其中,通信接口用于与其它网络设备(例如终端)进行通信,存储器用于存储上述异常账号检测方法的实现代码,处理器用于执行存储器中存储的程序代码,即执行上述异常账号检测方法。The embodiment of the present application also provides a computer device for executing the abnormal account detection method provided in the first aspect. The computer device may include a processor, a communication interface and a memory, and the processor, the communication interface and the memory are connected to each other. The communication interface is used to communicate with other network devices (such as terminals), the memory is used to store the implementation code of the above abnormal account detection method, and the processor is used to execute the program code stored in the memory, that is, the above abnormal account detection method is executed.
本申请实施例还提供了一种计算机非易失性可读存储介质,非易失性可读存储介质上 存储有指令,当其在处理器上运行时,使得处理器执行上述异常账号检测方法。The embodiment of the present application also provides a computer non-volatile readable storage medium, the non-volatile readable storage medium stores instructions, and when it runs on the processor, the processor executes the above abnormal account detection method .
本申请实施例还提供了一种包含指令的计算机程序产品,当其在处理器上运行时,使得处理器执行上述异常账号检测方法。The embodiment of the present application also provides a computer program product containing instructions, which when running on a processor, causes the processor to execute the above abnormal account detection method.
实施本申请实施例,无需人工设定一条“警戒线”,也无需预先提供“异常”的样本,能够自动识别出访问行为存在异常的账号。改变了传统的监控异常行为的方式,采用无监督的方式利用高斯核算法自动识别出分别异常的账号,并且,针对某一访问数据,可以从不同的时间维度上进行分层(比如统计一分钟内访问的次数、一小时访问的次数,一天内访问的次数),把一个特征按照时间维度划分为多个维度的特征分别进行高维空间映射,能够更加精准的查找到异常账号。To implement the embodiments of this application, there is no need to manually set a "warning line", nor to provide "abnormal" samples in advance, and it is possible to automatically identify accounts with abnormal access behaviors. The traditional method of monitoring abnormal behaviors has been changed. The Gaussian kernel algorithm is used to automatically identify the abnormal accounts in an unsupervised way, and for a certain access data, it can be layered from different time dimensions (for example, statistics for one minute). The number of internal visits, the number of visits in an hour, the number of visits in a day), a feature is divided into multiple dimensions according to the time dimension, and then high-dimensional space mapping is performed to find the abnormal account more accurately.
本申请附加的方面和优点将在下面的描述中部分给出,这些将从下面的描述中变得明显,或通过本申请的实践了解到。The additional aspects and advantages of this application will be partly given in the following description, which will become obvious from the following description, or be understood through the practice of this application.
附图说明Description of the drawings
本申请上述的和/或附加的方面和优点从下面结合附图对实施例的描述中将变得明显和容易理解,其中:The above and/or additional aspects and advantages of the present application will become obvious and easy to understand from the following description of the embodiments in conjunction with the accompanying drawings, in which:
图1是本申请实施例提供的一种计算机设备的硬件结构示意图;FIG. 1 is a schematic diagram of the hardware structure of a computer device provided by an embodiment of the present application;
图2是本申请实施例提供的一种异常账号检测方法的流程示意图;2 is a schematic flowchart of an abnormal account detection method provided by an embodiment of the present application;
图3A是本申请实施例提供的一种一维空间的示意图;FIG. 3A is a schematic diagram of a one-dimensional space provided by an embodiment of the present application;
图3B是本申请实施例提供的另一种多维空间的示意图;3B is a schematic diagram of another multi-dimensional space provided by an embodiment of the present application;
图4是本申请实施例提供的一种计算机设备的逻辑结构示意图。Fig. 4 is a schematic diagram of the logical structure of a computer device provided by an embodiment of the present application.
具体实施方式Detailed ways
下面详细描述本申请的实施例,所述实施例的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的,仅用于解释本申请,而不能解释为对本申请的限制。The embodiments of the present application are described in detail below. Examples of the embodiments are shown in the accompanying drawings, wherein the same or similar reference numerals indicate the same or similar elements or elements with the same or similar functions. The embodiments described below with reference to the drawings are exemplary, and are only used to explain the present application, and cannot be construed as a limitation to the present application.
本技术领域技术人员可以理解,除非特意声明,这里使用的单数形式“一”、“一个”、“所述”和“该”也可包括复数形式。应该进一步理解的是,本申请的说明书中使用的措辞“包括”是指存在所述特征、整数、步骤、操作,但是并不排除存在或添加一个或多个其他特征、整数、步骤、操作。Those skilled in the art can understand that, unless specifically stated, the singular forms "a", "an", "said" and "the" used herein may also include plural forms. It should be further understood that the term "comprising" used in the specification of this application refers to the presence of the described features, integers, steps, and operations, but does not exclude the presence or addition of one or more other features, integers, steps, and operations.
本技术领域技术人员可以理解,除非另外定义,这里使用的所有术语(包括技术术语和科学术语),具有与本申请所属领域中的普通技术人员的一般理解相同的意义。还应该理解的是,诸如通用字典中定义的那些术语,应该被理解为具有与现有技术的上下文中的意义一致的意义,并且除非像这里一样被特定定义,否则不会用理想化或过于正式的含义来解释。Those skilled in the art can understand that, unless otherwise defined, all terms (including technical terms and scientific terms) used herein have the same meaning as those commonly understood by those of ordinary skill in the art to which this application belongs. It should also be understood that terms such as those defined in general dictionaries should be understood to have a meaning consistent with the meaning in the context of the prior art, and unless specifically defined as here, they will not be idealized or overly Explain the formal meaning.
本领域技术人员应当理解,本申请所称的“应用”、“应用程序”、“应用软件”以及类似表述的概念,是业内技术人员所公知的相同概念,是指由一系列计算机指令及相关数据资源有机构造的适于电子运行的计算机软件。除非特别指定,这种命名本身不受编程语言种类、级别,也不受其赖以运行的操作系统或平台所限制。理所当然地,此类概念也不受任何形式的终端所限制。Those skilled in the art should understand that the concepts of “application”, “application”, “application software” and similar expressions referred to in this application are the same concepts well known by those skilled in the art, and refer to a series of computer instructions and related concepts. Data resources are organically constructed computer software suitable for electronic operation. Unless specifically specified, this naming itself is not restricted by the type and level of programming language, nor the operating system or platform on which it runs. Of course, such concepts are not restricted by any form of terminal.
首先针对本申请实施例涉及的计算机设备进行介绍。请参见图1,示出了本申请实施例提供的计算机设备,该计算机设备100可包括:存储器101、通信接口102、和一个或多个处理器103。这些部件可通过总线104或者其他方式连接,图1以通过总线连接为例。其中:First, the computer equipment involved in the embodiments of the present application is introduced. Please refer to FIG. 1, which shows a computer device provided by an embodiment of the present application. The computer device 100 may include a memory 101, a communication interface 102, and one or more processors 103. These components can be connected through the bus 104 or in other ways. FIG. 1 uses the bus connection as an example. among them:
存储器101可以和处理器103通过总线104或者输入输出端口耦合,存储器101 也可以与处理器103集成在一起。存储器101用于存储各种软件程序和/或多组指令。具体的,存储器101可包括高速随机存取的存储器,并且也可包括非易失性存储器,例如一个或多个磁盘存储设备、闪存设备或其他非易失性固态存储设备。存储器101还可以存储网络通信程序,该网络通信程序可用于与一个或多个附加设备,一个或多个终端,一个或多个网络设备进行通信。The memory 101 may be coupled with the processor 103 through a bus 104 or an input/output port, and the memory 101 may also be integrated with the processor 103. The memory 101 is used to store various software programs and/or multiple sets of instructions. Specifically, the memory 101 may include a high-speed random access memory, and may also include a non-volatile memory, such as one or more magnetic disk storage devices, flash memory devices, or other non-volatile solid-state storage devices. The memory 101 may also store a network communication program, which may be used to communicate with one or more additional devices, one or more terminals, and one or more network devices.
处理器103可以是通用处理器,例如中央处理器(central processing unit,CPU),还可以是数字信号处理器(digital signal processing,DSP)、专用集成电路(application specific integrated circuit,ASIC),或者是被配置成实施本申请实施例的一个或多个集成电路。处理器103可处理通过通信接口102接收到的数据。The processor 103 may be a general-purpose processor, such as a central processing unit (CPU), a digital signal processing (DSP), an application specific integrated circuit (ASIC), or One or more integrated circuits configured to implement the embodiments of the present application. The processor 103 can process data received through the communication interface 102.
通信接口102用于计算机设备100与其他网络设备进行通信,例如终端进行通信。通信接口102可以是收发器、收发电路等,其中,通信接口是统称,可以包括一个或多个接口,例如终端与服务器之间的接口。通信接口102可以包括有线接口和无线接口,例如标准接口、以太网、多机同步接口。The communication interface 102 is used for the computer device 100 to communicate with other network devices, such as a terminal for communication. The communication interface 102 may be a transceiver, a transceiver circuit, etc., where the communication interface is a general term and may include one or more interfaces, such as an interface between a terminal and a server. The communication interface 102 may include a wired interface and a wireless interface, such as a standard interface, Ethernet, and a multi-machine synchronization interface.
处理器103可用于读取和执行计算机可读指令。具体的,处理器103可用于调用存储于存储器101中的数据。可选地,当处理器103发送任何消息或数据时,其具体通过驱动或控制通信接口102做所述发送。可选地,当处理器103接收任何消息或数据时,其具体通过驱动或控制通信接口102做所述接收。因此,处理器103可以被视为是执行发送或接收的控制中心,通信接口102是发送和接收操作的具体执行者。The processor 103 can be used to read and execute computer-readable instructions. Specifically, the processor 103 may be used to call data stored in the memory 101. Optionally, when the processor 103 sends any message or data, it specifically drives or controls the communication interface 102 to do the sending. Optionally, when the processor 103 receives any message or data, it specifically drives or controls the communication interface 102 to make the reception. Therefore, the processor 103 can be regarded as a control center that performs sending or receiving, and the communication interface 102 is a specific performer of sending and receiving operations.
在本申请实施例中,通信接口102具体用于执行下述方法实施例中涉及的数据收发的步骤,处理器103具体用于实施除数据收发之外的数据处理的步骤。In the embodiment of the present application, the communication interface 102 is specifically configured to perform the data transceiving steps involved in the following method embodiments, and the processor 103 is specifically configured to implement data processing steps other than data transceiving.
本申请实施例中,计算机设备100可以是具备计算或处理能力的服务器或者终端设备等。In the embodiment of the present application, the computer device 100 may be a server or terminal device with computing or processing capabilities.
基于图1所示的计算机设备的结构,图2提供了本申请实施例涉及的一种异常账号检测方法,该异常账号检测方法包括但不限于如下步骤S201-S203。Based on the structure of the computer device shown in FIG. 1, FIG. 2 provides an abnormal account detection method related to an embodiment of the present application. The abnormal account detection method includes but is not limited to the following steps S201-S203.
步骤S201:获取N个账号在预设时间内访问目标网址的N组访问数据,所述N组访问数据中的每组访问数据为M维数据,N和M均为正整数。Step S201: Obtain N sets of access data for N accounts to access the target website within a preset time, each of the N sets of access data is M-dimensional data, and N and M are both positive integers.
可选的,所述访问数据包括访问时间或访问次数中的一种或多种。Optionally, the access data includes one or more of access time or access times.
例如,针对保险产品,访问目标网址的访问类型可以包括但不限于:登录、询价、检索或投保跟踪中的一种或多种。目标网址例如可以是登录的保险系统网址(或URL)、询价对应的保险系统网址(或URL)、检索对应的保险系统网址(或URL)或投保跟踪对应的保险系统网址(或URL)等。若访问类型为登录,则访问数据可以是访问某一网址(或URL)的访问时间和/或访问次数。若访问类型为询价,则访问数据可以是访问某一询价网址(或URL)进行询价的访问时间和/或访问次数。若访问类型为检索,则访问数据可以是访问某一检索网址(或URL)进行检索的访问时间和/或访问次数。若访问类型为投保跟踪,则访问数据可以是访问某一投保跟踪网址(或URL)进行投保跟踪的访问时间和/或访问次数。For example, for insurance products, the type of access to the target website may include, but is not limited to: one or more of login, inquiry, retrieval, or insurance tracking. The target website can be, for example, the insurance system website (or URL) that is logged in, the insurance system website (or URL) corresponding to the inquiry, the corresponding insurance system website (or URL), or the insurance system website (or URL) corresponding to insurance tracking, etc. . If the access type is login, the access data may be the access time and/or the number of accesses to a certain website (or URL). If the access type is an inquiry, the access data may be the access time and/or the number of times of access to a certain inquiry website (or URL) for inquiry. If the access type is retrieval, the access data may be the access time and/or the number of times of access to a certain retrieval website (or URL) for retrieval. If the access type is insurance tracking, the access data may be the access time and/or the number of visits to a certain insurance tracking website (or URL) for insurance tracking.
上述N个账号可以是访问目标网址的全部账号。预设时间可以是一秒钟、一分钟、一小时、一天或者一周等等。还可以分别设置工作时间段的预设时间和非工作时间段的预设时间。例如,工作时间段包括7:00-23:59,非工作时间段包括00:00-6:59。工作时间段对应的预设时间与非工作时间段对应的预设时间可以不同,例如,工作时间段对应的预设时间可以是1秒钟,非工作时间段对应的预设时间是1分钟。预设时间可以是一个,也可以是多个。且不同账号相同的预设时间各自对应的统计时间段可以相同,也可以不同。The aforementioned N accounts may be all accounts that access the target website. The preset time can be one second, one minute, one hour, one day or one week, etc. You can also set the preset time of the working period and the preset time of the non-working period respectively. For example, the working time period includes 7:00-23:59, and the non-working time period includes 00:00-6:59. The preset time corresponding to the working time period and the preset time corresponding to the non-working time period may be different. For example, the preset time corresponding to the working time period may be 1 second, and the preset time corresponding to the non-working time period is 1 minute. The preset time can be one or multiple. And the statistical time periods corresponding to the same preset time of different accounts can be the same or different.
针对某一种访问类型,统计N个账号在预设时间内的访问数据。例如,访问类型 为登录目标URL(如某一保险系统的网址),则统计N个账号在预设时间(例如一分钟)内登录目标URL的访问数据(例如访问次数)。在统计时,各个账号各自对应的一分钟可以是不同的,可以以某一账号在一天内访问次数最多的那一分钟确认为该账号的统计时间。例如,假设账号1在一天内的10:00-10:01这一分钟内访问目标URL的次数最多,则以账号1在10:00-10:01这一分钟内访问目标URL的次数作为账号1的统计数据。假设账号2在一天内的11:00-11:01这一分钟内访问目标URL的次数最多,则以账号2在11:00-11:01这一分钟内访问目标URL的次数作为账号2的统计数据。针对不同的访问类型,均可以采用上述方式获取各个账号在预设时间内的访问数据。例如,统计N个账号在一分钟内的询价的次数、统计N个账号在一分钟内检索的次数等等。For a certain type of access, count the access data of N accounts within a preset time. For example, if the access type is the login target URL (such as the website of a certain insurance system), then the statistics (such as the number of visits) of the N accounts logged in to the target URL within a preset time (such as one minute) are counted. In the statistics, the corresponding minute of each account can be different, and the minute that an account has the most access times in a day can be confirmed as the account's statistical time. For example, suppose that account 1 has the most number of visits to the target URL within a minute of 10:00-10:01 in a day, then the number of times that account 1 visits the target URL within a minute of 10:00-10:01 is used as the account 1 statistics. Assuming that account 2 has the most access to the target URL within a minute of 11:00-11:01 in a day, the number of times that account 2 accesses the target URL within a minute of 11:00-11:01 is taken as account 2’s Statistical data. For different access types, the aforementioned methods can be used to obtain access data of each account within a preset time. For example, count the number of inquiries made by N accounts in one minute, count the number of searches made by N accounts in one minute, and so on.
针对同一种访问类型,还可以统计各个账号在多个预设时间内的访问数据,各个预设时间可以不同。例如,可以统计N个账号在一秒钟内登录目标URL的次数、N个账号在一分钟内登录目标URL的次数、N个账号在一小时内登录UPR的次数以及N个账号在一天内登录目标URL的次数。For the same access type, it is also possible to count the access data of each account within multiple preset times, and each preset time may be different. For example, you can count the number of times N accounts log in to the target URL in one second, the number of times N accounts log in to the target URL in one minute, the number of times N accounts log in to UPR in one hour, and the number of times N accounts log in in one day. The number of target URLs.
可选的,针对不同的访问类型,预设时间可以不同。例如,针对登录这一访问类型,预设时间可以是一秒钟,针对询价这一访问类型,预设时间可以是一小时,针对投保跟踪这一访问类型,预设时间可以是一天,等等。Optionally, the preset time can be different for different access types. For example, for the access type of login, the preset time can be one second, for the access type of inquiry, the preset time can be one hour, for the access type of insurance tracking, the preset time can be one day, etc. Wait.
其中,访问数据可以是系统数据库中存储的数据,也可以是根据接收到的访问请求统计的。其中,用户可以利用账号访问相关的URL链接,某一账号在请求访问时,会携带相应的访问地址。系统在接收到某一账号发来的访问请求时,记录该账号的访问时间、访问地址以及该账号的标识。Among them, the access data may be data stored in the system database, or statistics based on the received access request. Among them, the user can use the account to access the related URL link, and a certain account will carry the corresponding access address when requesting access. When the system receives an access request from an account, it records the access time, access address, and the identification of the account.
步骤S202:采用高斯核函数将所述N组访问数据在每个维度上的数据映射到[0,1]区间。Step S202: Use a Gaussian kernel function to map the data of the N groups of access data in each dimension to the interval [0,1].
其中,高斯核是一种非线性的度量相似性的数学方法。通过使用高斯核,在每个特征维度上,都可以识别出相应维度上的异常点。这里,特征维度即包括:访问时间、访问次数、询价出单比、检索出单比或投保跟踪出单比等。Among them, the Gaussian kernel is a non-linear mathematical method for measuring similarity. By using the Gaussian kernel, in each feature dimension, anomalous points in the corresponding dimension can be identified. Here, the characteristic dimensions include: access time, number of visits, ratio of inquiring orders, retrieval orders, or insurance tracking orders, etc.
在本申请中,采用高斯核对N个账号进行二分类,由于N个账号的访问数据不是线性可分的,需要利用高斯核将数据集映射到高维空间。这样数据在高维空间中就线性可分。In this application, a Gaussian check is used to classify N accounts. Since the access data of N accounts is not linearly separable, a Gaussian kernel needs to be used to map the data set to a high-dimensional space. In this way, the data is linearly separable in the high-dimensional space.
可选的,访问数据可以是一维的,也可以是多维的。当访问数据为一维数据时,所述采用高斯核函数将所述N组访问数据在每个维度上的数据映射到[0,1]区间,包括:Optionally, the access data can be one-dimensional or multi-dimensional. When the access data is one-dimensional data, the use of a Gaussian kernel function to map the data of the N groups of access data in each dimension to the interval [0,1] includes:
利用铰链函数对所述N组访问数据进行初步变换,若所述N组访问数据中的目标访问数据大于所述N个访问数据中的众数,则将所述目标访问数据的值设置为所述目标访问数据与所述众数的差值;A hinge function is used to perform a preliminary transformation on the N sets of access data. If the target access data in the N sets of access data is greater than the mode of the N access data, the value of the target access data is set to all The difference between the target access data and the mode;
若所述N组访问数据中的目标访问数据小于等于所述N个访问数据中的众数,则将所述目标访问数据的值设置为0;If the target access data in the N sets of access data is less than or equal to the mode in the N access data, then the value of the target access data is set to 0;
根据经过所述初步变换后的所述N组访问数据中的众数确定高斯核变换的中心值;Determining the central value of the Gaussian kernel transformation according to the mode in the N sets of access data after the preliminary transformation;
根据经过所述初步变换后的所述N组访问数据中的最大值与最小值之差和第一预设值的乘积确定所述高斯核变换的尺度参数;Determining the scale parameter of the Gaussian kernel transformation according to the product of the difference between the maximum value and the minimum value in the N sets of access data after the preliminary transformation and a first preset value;
根据所述中心值和所述尺度参数进行高斯核变换处理得到一维空间,在所述一维空间中所述N组访问数据均分布到0-1值之间;Performing Gaussian kernel transformation processing according to the central value and the scale parameter to obtain a one-dimensional space, in which the N groups of access data are all distributed between 0-1 values;
所述将所述[0,1]区间中分布异常的访问数据对应的账号确定为异常账号,包括:The determining the account corresponding to the abnormally distributed access data in the interval [0,1] as the abnormal account includes:
将所述一维空间中与0值之间的距离小于等于第一预设距离的访问数据对应的账 号确定为异常账号。The account number corresponding to the access data whose distance from the zero value in the one-dimensional space is less than or equal to the first preset distance is determined as the abnormal account number.
其中,众数是指所述N个访问数据中出现次数最多的数值。所述第一预设值例如可以为二分之一或三分之一。可选的,第一预设距离可以人为设定,也可以设备默认设置。Wherein, the mode refers to the numerical value that appears most frequently among the N access data. The first preset value may be, for example, one-half or one-third. Optionally, the first preset distance can be set manually or set by default.
即针对某一指标的N个访问数据,取这N个访问数据中出现次数最多的众数作为基准,如果某个访问数据大于众数,则将该访问数据的值设置成该访问数据与众数的差值,如果小于众数,则设置成0。再做高斯核变换,高斯核变换过程涉及中心值和尺度参数,取处理后的值中的众数作为中心值,取处理后的值中的最大值和最小值的差值*二分之一(或三分之一)得到高斯核变换的尺度参数,得到了这2个参数后就可以进行高斯核变换处理。经过高斯核变换处理后,得到一维空间图,所有的值都会分布到0-1之间,离众数越远的在一维空间中越趋近于0。因此,可以根据目标点离0的远近来判断异常点,离0越近,则为异常点的概率越大,离1越近,则为异常点的概率越小。That is, for N access data of a certain index, take the mode with the most occurrences among the N access data as the benchmark. If a certain access data is greater than the mode, then set the value of the access data to the value of the access data. If the difference between the numbers is less than the mode, it is set to 0. Perform Gaussian kernel transformation again. The process of Gaussian kernel transformation involves center value and scale parameters. Take the mode of the processed value as the center value, and take the difference between the maximum and minimum values of the processed values * 1/2 (Or one-third) get the scale parameter of the Gaussian kernel transform, and after these two parameters are obtained, the Gaussian kernel transform can be processed. After Gaussian kernel transformation processing, a one-dimensional space map is obtained. All values will be distributed between 0-1. The farther away from the mode, the closer to 0 in the one-dimensional space. Therefore, the abnormal point can be judged according to the distance of the target point from 0. The closer to 0, the greater the probability of being an abnormal point, and the closer to 1, the smaller the probability of being an abnormal point.
当访问数据为多维数据时,所述采用高斯核函数将所述N组访问数据在每个维度上的数据映射到[0,1]区间,包括:When the access data is multi-dimensional data, the use of Gaussian kernel function to map the data of the N groups of access data in each dimension to the interval [0,1] includes:
利用铰链函数对所述N组访问数据在每个维度上进行初步变换,若所述N组访问数据中的目标访问数据在目标维度上的数值大于所述N个访问数据在所述目标维度上的众数,则将所述目标访问数据在所述目标维度上的值设置为所述目标访问数据在所述目标维度上的值与所述众数的差值;A hinge function is used to perform a preliminary transformation of the N sets of access data in each dimension, if the value of the target access data in the N sets of access data in the target dimension is greater than the value of the N access data in the target dimension The mode of the target access data, the value of the target access data in the target dimension is set as the difference between the value of the target access data in the target dimension and the mode;
若所述N组访问数据中的目标访问数据在所述目标维度上小于等于所述N个访问数据在所述目标维度上的众数,则将所述目标访问数据在所述目标维度上的值设置为0;If the target access data in the N sets of access data in the target dimension is less than or equal to the mode of the N access data in the target dimension, then the target access data in the target dimension The value is set to 0;
根据每个维度均经过数值变更后的所述N组访问数据在所述目标维度上的众数确定高斯核变换的中心值;Determining the central value of the Gaussian kernel transformation according to the mode of the N sets of access data in the target dimension after each dimension has undergone a numerical change;
根据每个维度均经过数值变更后的所述N组访问数据在所述目标维度上的最大值与最小值之差和第二预设值的乘积确定所述高斯核变换的尺度参数;Determine the scale parameter of the Gaussian kernel transformation according to the product of the difference between the maximum value and the minimum value of the N sets of access data in the target dimension and a second preset value after each dimension has undergone a numerical change;
根据所述中心值和所述尺度参数进行高斯核变换处理得到多维空间,在所述多维空间中所述N组访问数据在每个维度上的数据均分布到0-1值之间,其中,多维空间即多个[0,1]区间组成的超立方体,即所述N组访问数据均分布到[0,…,0]与[1,…,1]值之间,所述[0,…,0]和所述[1,…,1]均为M维数据;Performing Gaussian kernel transformation processing according to the central value and the scale parameter to obtain a multi-dimensional space, in which the data of the N groups of access data in each dimension are distributed between 0-1 values, wherein, A multi-dimensional space is a hypercube composed of multiple [0,1] intervals, that is, the N groups of access data are all distributed between the values [0,...,0] and [1,...,1], the [0, …,0] and the [1,…,1] are all M-dimensional data;
所述将所述[0,1]区间中分布异常的访问数据对应的账号确定为异常账号,包括:The determining the account corresponding to the abnormally distributed access data in the interval [0,1] as the abnormal account includes:
将所述多维空间中与空间基点之间的欧氏距离大于等于第二预设距离的访问数据对应的账号确定为异常账号,所述空间基点为值为[1,…,1]的点,所述[1,…,1]为M维数据。The account corresponding to the access data whose Euclidean distance between the Euclidean distance and the space base point is greater than or equal to the second preset distance in the multidimensional space is determined as an abnormal account, and the space base point is a point with a value of [1,...,1], The [1,...,1] is M-dimensional data.
可选的,所述第二预设值例如可以为二分之一或三分之一。可选的,第二预设距离可以人为设定,也可以设备默认设置。Optionally, the second preset value may be, for example, one-half or one-third. Optionally, the second preset distance can be manually set, or the device can be set by default.
针对多维数据,针对所有的维度按照前述一维数据构造了多维空间后,计算目标点与空间基点(即[1,…,1])之间的欧式距离来确定异常点的概率,以距离的长短作为判断标准,距离越长,则对应的账号为异常账号的概率越大,距离越短,则对应的账号为异常账号的概率越小。For multi-dimensional data, after constructing a multi-dimensional space according to the aforementioned one-dimensional data for all dimensions, calculate the Euclidean distance between the target point and the space base point (ie [1,...,1]) to determine the probability of abnormal points. The length is used as the criterion. The longer the distance, the greater the probability that the corresponding account is an abnormal account, and the shorter the distance, the smaller the probability that the corresponding account is an abnormal account.
以访问次数为例,获取N个账号在1分钟内各自访问某一URL的访问次数,采用高斯核对N个账号的访问次数进行处理,得到一维空间。如图3A所示,是本申请实施例提供的一种一维空间的示意图。该一维空间的一个坐标轴代表分布值,该分布值可以将原始数据(即统计得到的登录的次数)归一化到0-1区间。Taking the number of visits as an example, the number of visits of N accounts to a certain URL within 1 minute is obtained, and the number of visits of N accounts is checked by Gaussian to obtain a one-dimensional space. As shown in FIG. 3A, it is a schematic diagram of a one-dimensional space provided by an embodiment of the present application. One coordinate axis of the one-dimensional space represents the distribution value, and the distribution value can normalize the original data (that is, the number of logins obtained by statistics) to an interval of 0-1.
针对多种访问类型,可以将N个账号的多种访问类型对应的访问数据映射到多维空间。例如,可以采用N个账号在1分钟内登录(访问)目标URL的次数和N个账号在1分钟内询价的次数(例如可以是登录询价对应的保险系统网址(或URL)的次数)这2个维度建立二维空间。该二维空间中的一个坐标轴代表1分钟内登录目标URL的次数,另一个坐标轴代表1分钟内询价的次数。该二维空间可以例如图3B所示。在图3B中,将登录次数以及询价的次数均归一化到0-1区间内。又例如,可以采用N个账号在1分钟内登录目标URL的次数、N个账号在1分钟内询价的次数(例如可以是登录询价对应的保险系统网址(或URL)的次数)以及N个账号在1分钟内检索的次数(例如可以是登录检索对应的保险系统网址(或URL)的次数)这3个维度建立三维空间。该三维空间中的一个坐标轴代表1分钟内登录目标URL的次数,另一个坐标轴代表1分钟内询价的次数,另一个坐标轴代表1分钟内检索的次数。访问类型设置的越多,则可映射的多维空间的维度数也会越多。For multiple access types, the access data corresponding to the multiple access types of N accounts can be mapped to a multidimensional space. For example, the number of times N accounts have logged in (visited) the target URL in 1 minute and the number of times N accounts have inquired in 1 minute (for example, it can be the number of times the insurance system website (or URL) corresponding to the inquiry has been logged in) These two dimensions create a two-dimensional space. One axis in the two-dimensional space represents the number of logins to the target URL within 1 minute, and the other axis represents the number of inquiries within 1 minute. The two-dimensional space may be as shown in FIG. 3B, for example. In Figure 3B, the number of logins and the number of inquiries are both normalized to the 0-1 interval. For another example, the number of times that N accounts log in to the target URL within 1 minute, the number of times that N accounts have inquired in 1 minute (for example, the number of times that the insurance system website (or URL) corresponding to the inquiry is logged in), and N The three dimensions of the number of times that an account is retrieved within 1 minute (for example, the number of times that the corresponding insurance system website (or URL) is logged in and retrieved) are used to establish a three-dimensional space. One axis in the three-dimensional space represents the number of logins to the target URL within 1 minute, the other axis represents the number of inquiries within 1 minute, and the other axis represents the number of retrievals within 1 minute. The more access types are set, the more dimensions of the multidimensional space that can be mapped.
可选的,针对同一访问类型不同预设时间内的访问数据,也可以将N个账号在不同预设时间内的访问数据映射到多维空间中。例如,可以采用N个账号在1秒钟内登录目标URL的次数和N个账号在1分钟内登录目标URL的次数这2个维度建立多维空间。该多维空间中的一个坐标轴代表1秒钟内登录目标URL的次数,另一个坐标轴代表1分钟内登录目标URL的次数。该二维空间可以例如图3B所示。在图3B中,将1秒钟内登录目标URL的次数以及1分钟内登录目标URL的次数均归一化到0-1区间内。又例如,可以采用N个账号在1秒钟内登录目标URL的次数、N个账号在1分钟内登录目标URL的次数以及N个账号在1小时内登录目标URL的次数这3个维度建立多维空间。该多维空间中的一个坐标轴代表1秒钟内登录目标URL的次数,另一个坐标轴代表1分钟内登录目标URL的次数,另一个坐标轴代表1小时内登录目标URL的次数。预设时间的数量设置的越多,则可映射的多维空间的维度数也会越多。Optionally, for the access data of the same access type in different preset times, the access data of N accounts in different preset times may also be mapped into a multi-dimensional space. For example, the two dimensions of the number of times N accounts have logged in to the target URL in 1 second and the number of times N accounts have logged in to the target URL in 1 minute can be used to establish a multidimensional space. One axis in the multidimensional space represents the number of logins to the target URL in 1 second, and the other axis represents the number of logins to the target URL in 1 minute. The two-dimensional space may be as shown in FIG. 3B, for example. In FIG. 3B, the number of logins to the target URL within 1 second and the number of logins to the target URL within 1 minute are both normalized to a range of 0-1. For another example, the number of times N accounts log in to the target URL in 1 second, the number of times N accounts log in to the target URL in 1 minute, and the number of times N accounts log in to the target URL in 1 hour to build a multi-dimensional space. One axis in the multidimensional space represents the number of logins to the target URL within 1 second, the other axis represents the number of logins to the target URL within 1 minute, and the other axis represents the number of logins to the target URL within 1 hour. The more the number of preset times is set, the more the number of dimensions of the multidimensional space that can be mapped.
可选的,还可以将多个访问类型在多个预设时间内的访问数据映射到多维空间中。例如,可以采用N个账号在1秒钟内登录目标URL的次数、N个账号在1分钟内登录目标URL的次数、N个账号在1分钟内询价的次数(例如可以是登录询价对应的保险系统网址(或URL)的次数)以及N个账号在1小时内询价的次数(例如可以是登录询价对应的保险系统网址(或URL)的次数)这4个维度建立多维空间。该多维空间可以是四维的。其中的一个坐标轴代表1秒钟内登录目标URL的次数,另一个坐标轴代表1分钟内登录目标URL的次数,另一个坐标轴代表1分钟内询价的次数,第四个坐标轴代表1小时内询价的次数。Optionally, it is also possible to map the access data of multiple access types in multiple preset times to a multidimensional space. For example, the number of times N accounts log in to the target URL in 1 second, the number of times N accounts log in to the target URL in 1 minute, and the number of times N accounts have inquired in 1 minute (for example, can be corresponding to login inquiry The four dimensions of the number of insurance system URLs (or URLs) and the number of inquiries for N accounts within 1 hour (for example, the number of times the insurance system URLs (or URLs) corresponding to the inquiries are logged in) are four dimensions to create a multidimensional space. The multi-dimensional space can be four-dimensional. One axis represents the number of logins to the target URL within 1 second, the other axis represents the number of logins to the target URL within 1 minute, the other axis represents the number of inquiries within 1 minute, and the fourth axis represents 1 The number of inquiries in an hour.
步骤S203:将所述[0,1]区间中分布异常的访问数据对应的账号确定为异常账号,所述分布异常的访问数据是与所述N个账号中的至少一半的账号的访问数据分布不同的访问数据。Step S203: Determine the account corresponding to the abnormally distributed access data in the [0,1] interval as the abnormal account, and the abnormally distributed access data is the distribution of the access data of at least half of the N accounts Different access data.
如图3A所示,图3A中将N个账号各自在1分钟内登录目标URL的次数全部拉到0-1区间内,其中,距离0越近的账号越异常,距离1越近的账号越正常,由图3A可知,绝大多数的账号的访问次数均分布在1附近(例如绝大多数的账号访问的次数不超过1万次,分布在1附近),而少部分账号的访问次数则分布在0附近(例如少部分的账号访问的次数超过了10万次,分布在0附近),则可以表征出分布在0附近的账号为异常账号。As shown in Figure 3A, in Figure 3A, the number of times the N accounts have logged in to the target URL within 1 minute are all within the range of 0-1. Among them, the closer the account is to 0, the more abnormal, the closer the account is to 1, the more Normally, as shown in Figure 3A, the access times of most accounts are distributed around 1 (for example, the access times of most accounts do not exceed 10,000 times, distributed around 1), while the access times of a few accounts are Distributed near 0 (for example, a small number of accounts have been accessed more than 100,000 times, distributed near 0), it can be characterized that accounts distributed near 0 are abnormal accounts.
同样的,针对多维度的访问数据,得到多维空间,针对每一维,均能够识别出分布异常的点。例如,如果每个账号均是M维向量(例如M维向量分别包括:工作时间内的一分钟登录某一关键URL的次数、非工作时间内的一分钟登录某一关键URL的最大次数、工作时间内的一分钟询价数量、非工作时间内的一分钟询价数量、工作时间 内的一分钟检索数量、非工作时间内的一分钟检索数量、工作时间内的一分钟投保跟踪数量以及非工作时间内的一分钟的投保跟踪数量中的任意几种),则N个账号的共计N*M个访问数据,这N个账号是分布在M维空间坐标系中,根据目标点与空间基点([1,…,1])之间的欧氏距离确定异常点,与空间基点(1,1,1,…,1)之间的欧氏距离越大的为异常点的概率越大,与基点(1,1,1,…,1)之间的欧氏距离越小的为异常点的概率越小。可以设置一个阈值,与空间基点之间的欧式距离超过该阈值的则确定为是异常账号。如图3B所示,假设大多数(超过一半)的账号均分布在(1,1,1,…,1)附近,而少部分账号分布在(0,0,0,…,0)附近,分布在(0,0,0,…,0)附近的账号即为异常访问的账号。或只考虑每个账号距离空间基点(1,1,1,…,1)的欧式距离大小,距离越大者异常分数越高。Similarly, for multi-dimensional access data, a multi-dimensional space is obtained, and for each dimension, points with abnormal distribution can be identified. For example, if each account is an M-dimensional vector (for example, the M-dimensional vector includes: the number of logins to a key URL in one minute during working hours, the maximum number of logins to a key URL in one minute during non-working hours, and work One-minute inquiry quantity within time, one-minute inquiry quantity during non-working hours, one-minute retrieval quantity during working hours, one-minute retrieval quantity within non-working hours, one-minute insurance tracking quantity within working hours, and non-working hours Any number of insurance tracking quantity in one minute during working hours), then N accounts have a total of N*M access data, these N accounts are distributed in the M-dimensional space coordinate system, according to the target point and the space base point The Euclidean distance between ([1,...,1]) determines the anomalous point. The greater the Euclidean distance from the space base point (1,1,1,...,1), the greater the probability that the anomalous point is, The smaller the Euclidean distance from the base point (1,1,1,...,1) is, the smaller the probability of being an abnormal point is. A threshold can be set, and if the Euclidean distance from the space base point exceeds the threshold, it is determined to be an abnormal account. As shown in Figure 3B, assuming that most (more than half) of the accounts are distributed near (1,1,1,...,1), and a small number of accounts are distributed near (0,0,0,...,0), The accounts distributed near (0,0,0,...,0) are the accounts accessed abnormally. Or only consider the Euclidean distance of each account from the space base point (1,1,1,...,1). The larger the distance, the higher the anomaly score.
在可选的实施例中,针对某一种访问类型,还可以基于时间维度进行细分。例如,针对登录次数这一访问类型,可以依次统计N个账号1秒钟内登录某一URL的次数、N个账号1分钟内登录某一URL的次数、N个账号1天内登录某一URL的次数、N个账号1周内登录某一URL的次数,得到4组访问次数,利用这4组访问次数建立4维的空间坐标,得到N个账号的多维空间。将该4维的空间坐标中分别异常点确定为是异常访问的账号。针对每种访问数据,均可以将其基于时间维度划分成多组数据,最终基于多种访问数据各自划分的多组数据创建多维空间,将多维空间中分布异常的点确认为是异常账号。例如,每个账号均是5维向量,针对每个特征均选择两个时间段内的数据,例如1秒钟内的和1分钟内的,则10个账号的共计10*5*2个访问数据,这10个账号是分布在10维空间坐标系中,根据目标点与空间基点(1,1,1,1,1,1,1,1,1,1)之间的欧氏距离确定出异常点对应的异常账号,距离远的为异常点的概率越大。In an alternative embodiment, for a certain type of access, it may also be subdivided based on the time dimension. For example, for the access type of login times, you can sequentially count the number of times N accounts log in to a URL within 1 second, the number of times N accounts log in to a URL within 1 minute, and the number of times N accounts log in to a URL within 1 day. Number of times, the number of times that N accounts log in to a certain URL in a week, and 4 sets of visit times are obtained. The 4 sets of visit times are used to establish a 4-dimensional space coordinate to obtain a multidimensional space of N accounts. The respective abnormal points in the 4-dimensional space coordinates are determined as the accounts of abnormal access. For each type of access data, it can be divided into multiple sets of data based on the time dimension, and finally a multi-dimensional space is created based on multiple sets of data divided into multiple access data, and abnormal points in the multi-dimensional space are confirmed as abnormal accounts. For example, each account is a 5-dimensional vector, and data in two time periods are selected for each feature, for example, within 1 second and within 1 minute, a total of 10*5*2 visits for 10 accounts Data, these 10 accounts are distributed in a 10-dimensional space coordinate system, determined according to the Euclidean distance between the target point and the space base point (1,1,1,1,1,1,1,1,1,1) For the abnormal account corresponding to the abnormal point, the longer the distance is, the greater the probability of the abnormal point.
在确定出异常账号后,可以将异常账号添加到黑名单中。当下一次接收到异常账号发起的访问请求时,拒绝其访问请求。After the abnormal account is determined, the abnormal account can be added to the blacklist. When the access request initiated by the abnormal account is received next time, the access request is rejected.
可选的,在检测出异常账号后,对异常账号的IP地址或者域名进行识别,将IP地址与异常账号的IP地址位于同一网段或者域名与异常账号的域名相同的其他账号设置为异常账号添加到黑名单中。或者,将IP地址与异常账号的IP地址位于同一网段或者域名与异常账号的域名相同的其他账号设置为疑似异常的账号,将疑似异常的账号进行统计,并生成表格进行输出,操作人员可以重点关注疑似异常的账号。Optionally, after detecting the abnormal account, identify the IP address or domain name of the abnormal account, and set other accounts whose IP address and the IP address of the abnormal account are in the same network segment or the domain name of the abnormal account are the same as the abnormal account Add to the blacklist. Or, set other accounts whose IP address and the IP address of the abnormal account are in the same network segment or whose domain name is the same as the domain name of the abnormal account as the suspected abnormal account, count the suspected abnormal account, and generate a table for output. The operator can Focus on accounts that are suspected of being abnormal.
具体实现中,可以定期的进行异常账号的检测过程。例如,每周检测一次,或者每月检测一次。在每次进行异常账号检测时,可以基于系统中最新的数据进行高斯分布。例如,1月8号进行异常账号检测,则可以从系统中找到1月1号-1月7号这一周内各个账号的访问数据,基于最新的访问数据进行高斯分布以及异常账号检测能够及时监测到最近一段时间内的异常账号,可以及时避免系统中的隐私数据被窃取或泄露。In specific implementation, the process of detecting abnormal accounts can be performed periodically. For example, testing once a week, or testing once a month. In each abnormal account detection, Gaussian distribution can be performed based on the latest data in the system. For example, if abnormal account detection is performed on January 8, the access data of each account during the week from January 1 to January 7 can be found in the system. Gaussian distribution and abnormal account detection based on the latest access data can be monitored in time The abnormal accounts in the recent period can prevent the privacy data in the system from being stolen or leaked in time.
相较于现有技术,本申请实施例无需人工设定一条“警戒线”,也无需预先提供“异常”的样本,能够自动识别出访问行为存在异常的账号。改变了传统的监控异常行为的方式,采用无监督的方式利用高斯核算法自动识别出分别异常的账号,并且,针对某一访问数据,可以从不同的时间维度上进行分层(比如统计一分钟内访问的次数、一小时访问的次数,一天内访问的次数),把一个特征按照时间维度划分为多个维度的特征分别进行高维空间映射,能够更加精准的查找到异常账号。Compared with the prior art, the embodiment of the present application does not need to manually set a "warning line", nor does it need to provide "abnormal" samples in advance, and can automatically identify accounts with abnormal access behaviors. The traditional method of monitoring abnormal behaviors has been changed. The Gaussian kernel algorithm is used to automatically identify the abnormal accounts in an unsupervised way, and for a certain access data, it can be layered from different time dimensions (for example, statistics for one minute). The number of internal visits, the number of visits in an hour, the number of visits in a day), a feature is divided into multiple dimensions according to the time dimension, and then high-dimensional space mapping is performed to find the abnormal account more accurately.
参见图4,图4示给出了一种异常账号检测装置的结构示意图,如图4所示,该异常账号检测装置400包括:获取单元401,映射单元402和确定单元403。Referring to FIG. 4, FIG. 4 shows a schematic structural diagram of an abnormal account detection device. As shown in FIG. 4, the abnormal account detection device 400 includes: an acquisition unit 401, a mapping unit 402 and a determination unit 403.
其中,获取单元401,获取N个账号在预设时间内访问目标网址的N组访问数据,所述N组访问数据中的每组访问数据为M维数据,N和M均为正整数;Wherein, the obtaining unit 401 obtains N sets of access data for N accounts to access the target website within a preset time, each of the N sets of access data is M-dimensional data, and N and M are both positive integers;
映射单元402,用于采用高斯核函数将所述N组访问数据在每个维度上的数据映射到[0,1]区间;The mapping unit 402 is configured to map the data of the N groups of access data in each dimension to the interval [0,1] by using a Gaussian kernel function;
确定单元403,用于将所述[0,1]区间中分布异常的访问数据对应的账号确定为异常账号,所述分布异常的访问数据是与所述N个账号中的至少一半的账号的访问数据分布不同的访问数据。The determining unit 403 is configured to determine the account corresponding to the abnormally distributed access data in the [0,1] interval as the abnormal account, and the abnormally distributed access data is related to at least half of the N accounts Access data is distributed with different access data.
在一种可能的设计中,M等于1,即访问数据为一维数据;所述映射单元402具体用于:In a possible design, M is equal to 1, that is, the access data is one-dimensional data; the mapping unit 402 is specifically configured to:
利用铰链函数对所述N组访问数据进行初步变换,若所述N组访问数据中的目标访问数据大于所述N个访问数据中的众数,则将所述目标访问数据的值设置为所述目标访问数据与所述众数的差值;A hinge function is used to perform a preliminary transformation on the N sets of access data. If the target access data in the N sets of access data is greater than the mode of the N access data, the value of the target access data is set to all The difference between the target access data and the mode;
若所述N组访问数据中的目标访问数据小于等于所述N个访问数据中的众数,则将所述目标访问数据的值设置为0;If the target access data in the N sets of access data is less than or equal to the mode in the N access data, then the value of the target access data is set to 0;
根据经过所述初步变换后的所述N组访问数据中的众数确定高斯核变换的中心值;Determining the central value of the Gaussian kernel transformation according to the mode in the N sets of access data after the preliminary transformation;
根据经过所述初步变换后的所述N组访问数据中的最大值与最小值之差和第一预设值的乘积确定所述高斯核变换的尺度参数;Determining the scale parameter of the Gaussian kernel transformation according to the product of the difference between the maximum value and the minimum value in the N sets of access data after the preliminary transformation and a first preset value;
根据所述中心值和所述尺度参数进行高斯核变换处理得到一维空间,在所述一维空间中所述N组访问数据均分布到0-1值之间;Performing Gaussian kernel transformation processing according to the central value and the scale parameter to obtain a one-dimensional space, in which the N groups of access data are all distributed between 0-1 values;
所述确定单元403具体用于:The determining unit 403 is specifically configured to:
将所述一维空间中与0值之间的距离小于等于第一预设距离的访问数据对应的账号确定为异常账号。The account corresponding to the access data whose distance from the value 0 in the one-dimensional space is less than or equal to the first preset distance is determined as an abnormal account.
在一种可能的设计中,M大于等于2,即访问数据为多维数据;所述映射单元402具体用于:In a possible design, M is greater than or equal to 2, that is, the access data is multi-dimensional data; the mapping unit 402 is specifically configured to:
利用铰链函数对所述N组访问数据在每个维度上进行初步变换,若所述N组访问数据中的目标访问数据在目标维度上的数值大于所述N个访问数据在所述目标维度上的众数,则将所述目标访问数据在所述目标维度上的值设置为所述目标访问数据在所述目标维度上的值与所述众数的差值;A hinge function is used to perform a preliminary transformation of the N sets of access data in each dimension, if the value of the target access data in the N sets of access data in the target dimension is greater than the value of the N access data in the target dimension The mode of the target access data, the value of the target access data in the target dimension is set as the difference between the value of the target access data in the target dimension and the mode;
若所述N组访问数据中的目标访问数据在所述目标维度上小于等于所述N个访问数据在所述目标维度上的众数,则将所述目标访问数据在所述目标维度上的值设置为0;If the target access data in the N sets of access data in the target dimension is less than or equal to the mode of the N access data in the target dimension, then the target access data in the target dimension The value is set to 0;
根据每个维度均经过数值变更后的所述N组访问数据在所述目标维度上的众数确定高斯核变换的中心值;Determining the central value of the Gaussian kernel transformation according to the mode of the N sets of access data in the target dimension after each dimension has undergone a numerical change;
根据每个维度均经过数值变更后的所述N组访问数据在所述目标维度上的最大值与最小值之差和第二预设值的乘积确定所述高斯核变换的尺度参数;Determine the scale parameter of the Gaussian kernel transformation according to the product of the difference between the maximum value and the minimum value of the N sets of access data in the target dimension and a second preset value after each dimension has undergone a numerical change;
根据所述中心值和所述尺度参数进行高斯核变换处理得到多维空间,在所述多维空间中所述N组访问数据在每个维度上的数据均分布到0-1值之间;Performing Gaussian kernel transformation processing according to the central value and the scale parameter to obtain a multi-dimensional space, in which the data of the N groups of access data in each dimension are distributed between 0-1 values;
所述确定单元403具体用于:The determining unit 403 is specifically configured to:
将所述多维空间中与空间基点之间的欧氏距离大于等于第二预设距离的访问数据对应的账号确定为异常账号,所述空间基点为值为[1,…,1]的点,所述[1,…,1]为M维数据。The account corresponding to the access data whose Euclidean distance between the Euclidean distance and the space base point is greater than or equal to the second preset distance in the multidimensional space is determined as an abnormal account, and the space base point is a point with a value of [1,...,1], The [1,...,1] is M-dimensional data.
在一种可能的设计中,所述装置400还包括:In a possible design, the device 400 further includes:
添加单元,用于在所述确定单元403将所述[0,1]区间中分布异常的访问数据对应的账号确定为异常账号之后,将所述异常账号添加到黑名单中,在下一次接收到所述异常账号发起的访问请求时,拒绝所述访问请求。The adding unit is used to add the abnormal account to the blacklist after the determining unit 403 determines that the account corresponding to the abnormally distributed access data in the interval [0,1] is the abnormal account, and the abnormal account is received next time When the access request is initiated by the abnormal account, the access request is rejected.
在一种可能的设计中,所述装置400还包括:In a possible design, the device 400 further includes:
添加单元,用于在所述确定单元403将所述[0,1]区间中分布异常的访问数据对应的账号确定为异常账号之后,对所述异常账号的互联网协议IP地址或者域名进行识别,将IP地址与所述异常账号的IP地址位于同一网段或者域名与所述异常账号的域名相同的其他账号设置为异常账号并添加到黑名单中;The adding unit is configured to identify the Internet Protocol IP address or domain name of the abnormal account after the determining unit 403 determines the account corresponding to the abnormally distributed access data in the interval [0,1] as the abnormal account, Setting other accounts whose IP address and the IP address of the abnormal account are on the same network segment or whose domain name is the same as the domain name of the abnormal account as abnormal accounts and added to the blacklist;
或者,将IP地址与所述异常账号的IP地址位于同一网段或者域名与所述异常账号的域名相同的其他账号设置为疑似异常的账号,将所述疑似异常的账号进行统计,并生成表格进行输出。Or, set other accounts whose IP address and the IP address of the abnormal account are in the same network segment or whose domain name is the same as the domain name of the abnormal account as the suspected abnormal account, collect statistics on the suspected abnormal account, and generate a table Perform output.
在一种可能的设计中,所述预设时间包括多个,所述N个账号中的目标账号对应的M维数据包括所述目标账号在所述多个预设时间内访问所述目标网址的访问数据。In a possible design, the preset time includes multiple, and the M-dimensional data corresponding to the target account in the N accounts includes that the target account accesses the target website within the multiple preset times Access data.
在一种可能的设计中,所述预设时间包括工作时间段对应的预设时间和/或非工作时间段对应的预设时间;所述预设时间包括一秒钟、一分钟、一小时、一天或一周中的任意一种。In a possible design, the preset time includes a preset time corresponding to a working time period and/or a preset time corresponding to a non-working time period; the preset time includes one second, one minute, and one hour , One day or one week.
在一种可能的设计中,所述访问类型包括询价、检索或投保跟踪。In one possible design, the access type includes inquiry, retrieval or insurance tracking.
在一种可能的设计中,所述获取单元401,用于在预设时间内访问目标网址的N个账号的访问数据,包括:In a possible design, the acquiring unit 401 is configured to access the access data of the N accounts of the target website within a preset time, including:
按照预设周期获取N个账号在预设时间内访问目标网址的访问数据。Obtain access data for N accounts to access the target website within a preset time according to a preset period.
需要说明的是,关于异常账号检测装置400的具体实现过程可以参见前述图2所示方法实施例的相关描述,此处不再赘述。It should be noted that, for the specific implementation process of the abnormal account detection device 400, reference may be made to the related description of the method embodiment shown in FIG. 2, which will not be repeated here.
在本申请的另一实施例中提供一种计算机非易失性可读存储介质,所述计算机非易失性可读存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令被处理器执行时实现。In another embodiment of the present application, a computer non-volatile readable storage medium is provided. The computer non-volatile readable storage medium stores a computer program. The computer program includes program instructions. Realized when executed by the processor.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机非易失性可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分,或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer non-volatile readable storage medium. Based on this understanding, the technical solution of this application is essentially or the part that contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium It includes several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the method described in each embodiment of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic disk or optical disk and other media that can store program code .
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机非易失性可读存储介质中,或者从一个计算机非易失性可读存储介质向另一个计算机非易失性可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(digital subscriber line,DSL)或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机非易失性可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如数字多功能光盘(digital versatile disc,DVD)、半导体介质(例如固态硬盘solid state disk,SSD)等。In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware or any combination thereof. When implemented by software, it can be implemented in the form of a computer program product in whole or in part. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, all or part of the processes or functions described in the embodiments of the present application are generated. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices. The computer instructions may be stored in a computer non-volatile readable storage medium, or transmitted from one computer non-volatile readable storage medium to another computer non-volatile readable storage medium, for example, the computer instructions It can be from one website site, computer, server or data center through wired (such as coaxial cable, optical fiber, digital subscriber line (digital subscriber line, DSL) or wireless (such as infrared, wireless, microwave, etc.) to another website site, Computer, server or data center for transmission. The computer non-volatile readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or data center integrated with one or more available media. The usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, and a magnetic tape), an optical medium (for example, a digital versatile disc (DVD), a semiconductor medium (for example, a solid state disk, SSD), etc.
以上所述的具体实施方式,对本申请实施例的目的、技术方案和有益效果进行了 进一步详细说明,所应理解的是,以上所述仅为本申请实施例的具体实施方式而已,并不用于限定本申请实施例的保护范围,凡在本申请实施例的技术方案的基础之上,所做的任何修改、等同替换、改进等,均应包括在本申请实施例的保护范围之内。The specific implementations described above further describe the purpose, technical solutions and beneficial effects of the embodiments of this application in further detail. It should be understood that the above descriptions are only specific implementations of the embodiments of this application and are not intended to To limit the protection scope of the embodiments of the application, any modification, equivalent replacement, improvement, etc. made on the basis of the technical solutions of the embodiments of the application shall be included in the protection scope of the embodiments of the application.

Claims (20)

  1. 一种异常账号检测方法,其特征在于,包括:An abnormal account detection method, characterized by comprising:
    获取N个账号在预设时间内访问目标网址的N组访问数据,所述N组访问数据中的每组访问数据为M维数据,N和M均为正整数;Acquire N sets of access data for N accounts to access the target website within a preset time, each of the N sets of access data is M-dimensional data, and N and M are both positive integers;
    采用高斯核函数将所述N组访问数据在每个维度上的数据映射到[0,1]区间;Using Gaussian kernel function to map the data of the N groups of access data in each dimension to the interval [0,1];
    将所述[0,1]区间中分布异常的访问数据对应的账号确定为异常账号,所述分布异常的访问数据是与所述N个账号中的至少一半的账号的访问数据分布不同的访问数据。The account corresponding to the abnormally distributed access data in the [0,1] interval is determined to be an abnormal account, and the abnormally distributed access data is an access that is different from the access data distribution of at least half of the N accounts data.
  2. 根据权利要求1所述的方法,其特征在于,M等于1;所述采用高斯核函数将所述N组访问数据在每个维度上的数据映射到[0,1]区间,包括:The method according to claim 1, characterized in that M is equal to 1; said using a Gaussian kernel function to map the data of the N groups of access data in each dimension to the interval [0,1] comprises:
    利用铰链函数对所述N组访问数据进行初步变换,若所述N组访问数据中的目标访问数据大于所述N个访问数据中的众数,则将所述目标访问数据的值设置为所述目标访问数据与所述众数的差值;A hinge function is used to perform a preliminary transformation on the N sets of access data. If the target access data in the N sets of access data is greater than the mode of the N access data, the value of the target access data is set to all The difference between the target access data and the mode;
    若所述N组访问数据中的目标访问数据小于等于所述N个访问数据中的众数,则将所述目标访问数据的值设置为0;If the target access data in the N sets of access data is less than or equal to the mode in the N access data, then the value of the target access data is set to 0;
    根据经过所述初步变换后的所述N组访问数据中的众数确定高斯核变换的中心值;Determining the central value of the Gaussian kernel transformation according to the mode in the N sets of access data after the preliminary transformation;
    根据经过所述初步变换后的所述N组访问数据中的最大值与最小值之差和第一预设值的乘积确定所述高斯核变换的尺度参数;Determining the scale parameter of the Gaussian kernel transformation according to the product of the difference between the maximum value and the minimum value in the N sets of access data after the preliminary transformation and a first preset value;
    根据所述中心值和所述尺度参数进行高斯核变换处理得到一维空间,在所述一维空间中所述N组访问数据均分布到0-1值之间;Performing Gaussian kernel transformation processing according to the central value and the scale parameter to obtain a one-dimensional space, in which the N groups of access data are all distributed between 0-1 values;
    所述将所述[0,1]区间中分布异常的访问数据对应的账号确定为异常账号,包括:The determining the account corresponding to the abnormally distributed access data in the interval [0,1] as the abnormal account includes:
    将所述一维空间中与0值之间的距离小于等于第一预设距离的访问数据对应的账号确定为异常账号。The account corresponding to the access data whose distance from the value 0 in the one-dimensional space is less than or equal to the first preset distance is determined as an abnormal account.
  3. 根据权利要求1所述的方法,其特征在于,M大于等于2;所述采用高斯核函数将所述N组访问数据在每个维度上的数据映射到[0,1]区间,包括:The method according to claim 1, wherein M is greater than or equal to 2; said using a Gaussian kernel function to map the data of the N groups of access data in each dimension to the interval [0,1] comprises:
    利用铰链函数对所述N组访问数据在每个维度上进行初步变换,若所述N组访问数据中的目标访问数据在目标维度上的数值大于所述N个访问数据在所述目标维度上的众数,则将所述目标访问数据在所述目标维度上的值设置为所述目标访问数据在所述目标维度上的值与所述众数的差值;A hinge function is used to perform a preliminary transformation of the N sets of access data in each dimension, if the value of the target access data in the N sets of access data in the target dimension is greater than the value of the N access data in the target dimension The mode of the target access data, the value of the target access data in the target dimension is set as the difference between the value of the target access data in the target dimension and the mode;
    若所述N组访问数据中的目标访问数据在所述目标维度上小于等于所述N个访问数据在所述目标维度上的众数,则将所述目标访问数据在所述目标维度上的值设置为0;If the target access data in the N sets of access data in the target dimension is less than or equal to the mode of the N access data in the target dimension, then the target access data in the target dimension The value is set to 0;
    根据每个维度均经过数值变更后的所述N组访问数据在所述目标维度上的众数确定高斯核变换的中心值;Determining the central value of the Gaussian kernel transformation according to the mode of the N sets of access data in the target dimension after each dimension has undergone a numerical change;
    根据每个维度均经过数值变更后的所述N组访问数据在所述目标维度上的最大值与最小值之差和第二预设值的乘积确定所述高斯核变换的尺度参数;Determine the scale parameter of the Gaussian kernel transformation according to the product of the difference between the maximum value and the minimum value of the N sets of access data in the target dimension and a second preset value after each dimension has undergone a numerical change;
    根据所述中心值和所述尺度参数进行高斯核变换处理得到多维空间,在所述多维空间中所述N组访问数据在每个维度上的数据均分布到0-1值之间;Performing Gaussian kernel transformation processing according to the central value and the scale parameter to obtain a multi-dimensional space, in which the data of the N groups of access data in each dimension are distributed between 0-1 values;
    所述将所述[0,1]区间中分布异常的访问数据对应的账号确定为异常账号,包括:The determining the account corresponding to the abnormally distributed access data in the interval [0,1] as the abnormal account includes:
    将所述多维空间中与空间基点之间的欧氏距离大于等于第二预设距离的访问数据对应的账号确定为异常账号,所述空间基点为值为[1,…,1]的点,所述[1,…,1]为M维数据。The account corresponding to the access data whose Euclidean distance between the Euclidean distance and the space base point is greater than or equal to the second preset distance in the multidimensional space is determined as an abnormal account, and the space base point is a point with a value of [1,...,1], The [1,...,1] is M-dimensional data.
  4. 根据权利要求1至3任一项所述的方法,其特征在于,所述预设时间包括多个,所述N个账号中的目标账号对应的M维数据包括所述目标账号在所述多个预设时间内访问所述目标网址的访问数据。The method according to any one of claims 1 to 3, wherein the preset time includes multiple, and the M-dimensional data corresponding to the target account in the N accounts includes the target account in the multiple Access data for accessing the target website within a preset time.
  5. 根据权利要求4所述的方法,其特征在于,所述获取N个账号在预设时间内访问目 标网址的N组访问数据,包括:The method according to claim 4, wherein the obtaining N sets of access data for N accounts to access the target website within a preset time comprises:
    获取所述N个账号在所述多个预设时间内访问所述目标网址的访问数据。Obtaining access data of the N accounts for accessing the target website within the multiple preset times.
  6. 根据权利要求1至3任一项所述的方法,其特征在于,所述将所述[0,1]区间中分布异常的访问数据对应的账号确定为异常账号之后,还包括:The method according to any one of claims 1 to 3, wherein after determining the account corresponding to the abnormally distributed access data in the interval [0,1] as the abnormal account, the method further comprises:
    将所述异常账号添加到黑名单中,在下一次接收到所述异常账号发起的访问请求时,拒绝所述访问请求。The abnormal account is added to the blacklist, and the access request is rejected when the access request initiated by the abnormal account is received next time.
  7. 根据权利要求1至3任一项所述的方法,其特征在于,所述将所述[0,1]区间中分布异常的访问数据对应的账号确定为异常账号之后,还包括:The method according to any one of claims 1 to 3, wherein after determining the account corresponding to the abnormally distributed access data in the interval [0,1] as the abnormal account, the method further comprises:
    对所述异常账号的互联网协议IP地址或者域名进行识别,将IP地址与所述异常账号的IP地址位于同一网段或者域名与所述异常账号的域名相同的其他账号设置为异常账号并添加到黑名单中;The Internet Protocol IP address or domain name of the abnormal account is identified, and other accounts whose IP address and the IP address of the abnormal account are in the same network segment or whose domain name is the same as the domain name of the abnormal account are set as abnormal accounts and added to Blacklist
    或者,将IP地址与所述异常账号的IP地址位于同一网段或者域名与所述异常账号的域名相同的其他账号设置为疑似异常的账号,将所述疑似异常的账号进行统计,并生成表格进行输出。Or, set other accounts whose IP address and the IP address of the abnormal account are in the same network segment or whose domain name is the same as the domain name of the abnormal account as the suspected abnormal account, collect statistics on the suspected abnormal account, and generate a table Perform output.
  8. 根据权利要求1至3任一项所述的方法,其特征在于,所述预设时间包括工作时间段对应的预设时间和/或非工作时间段对应的预设时间;所述预设时间包括一秒钟、一分钟、一小时、一天或一周中的任意一种。The method according to any one of claims 1 to 3, wherein the preset time includes a preset time corresponding to a working time period and/or a preset time corresponding to a non-working time period; the preset time Including any of one second, one minute, one hour, one day, or one week.
  9. 根据权利要求1至3任一项所述的方法,其特征在于,所述获取N个账号在预设时间内访问目标网址的N组访问数据,包括:The method according to any one of claims 1 to 3, wherein the obtaining N sets of access data for N accounts to access the target website within a preset time includes:
    按照预设周期获取所述N个账号在所述预设时间内访问所述目标网址的访问数据。Obtain access data for the N accounts to access the target website within the preset time according to a preset period.
  10. 一种异常账号检测装置,其特征在于,包括:An abnormal account detection device, characterized by comprising:
    获取单元,用于获取N个账号在预设时间内访问目标网址的N组访问数据,所述N组访问数据中的每组访问数据为M维数据,N和M均为正整数;The obtaining unit is configured to obtain N sets of access data for N accounts to access the target website within a preset time, each of the N sets of access data is M-dimensional data, and N and M are both positive integers;
    映射单元,用于采用高斯核函数将所述N组访问数据在每个维度上的数据映射到[0,1]区间;A mapping unit for mapping the data of the N groups of access data in each dimension to the interval [0,1] by using a Gaussian kernel function;
    确定单元,用于将所述[0,1]区间中分布异常的访问数据对应的账号确定为异常账号,所述分布异常的访问数据是与所述N个账号中的至少一半的账号的访问数据分布不同的访问数据。The determining unit is configured to determine the account corresponding to the abnormally distributed access data in the [0,1] interval as the abnormal account, and the abnormally distributed access data is access to at least half of the N accounts The data is distributed with different access data.
  11. 根据权利要求10所述的装置,其特征在于,M等于1;所述映射单元具体用于:The device according to claim 10, wherein M is equal to 1, and the mapping unit is specifically configured to:
    利用铰链函数对所述N组访问数据进行初步变换,若所述N组访问数据中的目标访问数据大于所述N个访问数据中的众数,则将所述目标访问数据的值设置为所述目标访问数据与所述众数的差值;A hinge function is used to perform a preliminary transformation on the N sets of access data. If the target access data in the N sets of access data is greater than the mode of the N access data, the value of the target access data is set to all The difference between the target access data and the mode;
    若所述N组访问数据中的目标访问数据小于等于所述N个访问数据中的众数,则将所述目标访问数据的值设置为0;If the target access data in the N sets of access data is less than or equal to the mode in the N access data, then the value of the target access data is set to 0;
    根据经过所述初步变换后的所述N组访问数据中的众数确定高斯核变换的中心值;Determining the central value of the Gaussian kernel transformation according to the mode in the N sets of access data after the preliminary transformation;
    根据经过所述初步变换后的所述N组访问数据中的最大值与最小值之差和第一预设值的乘积确定所述高斯核变换的尺度参数;Determining the scale parameter of the Gaussian kernel transformation according to the product of the difference between the maximum value and the minimum value in the N sets of access data after the preliminary transformation and a first preset value;
    根据所述中心值和所述尺度参数进行高斯核变换处理得到一维空间,在所述一维空间中所述N组访问数据均分布到0-1值之间;Performing Gaussian kernel transformation processing according to the central value and the scale parameter to obtain a one-dimensional space, in which the N groups of access data are all distributed between 0-1 values;
    所述确定单元具体用于:The determining unit is specifically used for:
    将所述一维空间中与0值之间的距离小于等于第一预设距离的访问数据对应的账号确定为异常账号。The account corresponding to the access data whose distance from the value 0 in the one-dimensional space is less than or equal to the first preset distance is determined as an abnormal account.
  12. 根据权利要求10所述的装置,其特征在于,M大于等于2;所述映射单元具体用于:The apparatus according to claim 10, wherein M is greater than or equal to 2; and the mapping unit is specifically configured to:
    利用铰链函数对所述N组访问数据在每个维度上进行初步变换,若所述N组访问数据中的目标访问数据在目标维度上的数值大于所述N个访问数据在所述目标维度上的众数,则将所述目标访问数据在所述目标维度上的值设置为所述目标访问数据在所述目标维度上的值与所述众数的差值;A hinge function is used to perform a preliminary transformation of the N sets of access data in each dimension, if the value of the target access data in the N sets of access data in the target dimension is greater than the value of the N access data in the target dimension The mode of the target access data, the value of the target access data in the target dimension is set as the difference between the value of the target access data in the target dimension and the mode;
    若所述N组访问数据中的目标访问数据在所述目标维度上小于等于所述N个访问数据在所述目标维度上的众数,则将所述目标访问数据在所述目标维度上的值设置为0;If the target access data in the N sets of access data in the target dimension is less than or equal to the mode of the N access data in the target dimension, then the target access data in the target dimension The value is set to 0;
    根据每个维度均经过数值变更后的所述N组访问数据在所述目标维度上的众数确定高斯核变换的中心值;Determining the central value of the Gaussian kernel transformation according to the mode of the N sets of access data in the target dimension after each dimension has undergone a numerical change;
    根据每个维度均经过数值变更后的所述N组访问数据在所述目标维度上的最大值与最小值之差和第二预设值的乘积确定所述高斯核变换的尺度参数;Determine the scale parameter of the Gaussian kernel transformation according to the product of the difference between the maximum value and the minimum value of the N sets of access data in the target dimension and a second preset value after each dimension has undergone a numerical change;
    根据所述中心值和所述尺度参数进行高斯核变换处理得到多维空间,在所述多维空间中所述N组访问数据在每个维度上的数据均分布到0-1值之间;Performing Gaussian kernel transformation processing according to the central value and the scale parameter to obtain a multi-dimensional space, in which the data of the N groups of access data in each dimension are distributed between 0-1 values;
    所述确定单元具体用于:The determining unit is specifically used for:
    将所述多维空间中与空间基点之间的欧氏距离大于等于第二预设距离的访问数据对应的账号确定为异常账号,所述空间基点为值为[1,…,1]的点,所述[1,…,1]为M维数据。The account corresponding to the access data whose Euclidean distance between the Euclidean distance and the space base point is greater than or equal to the second preset distance in the multidimensional space is determined as an abnormal account, and the space base point is a point with a value of [1,...,1], The [1,...,1] is M-dimensional data.
  13. 根据权利要求10至12任一项所述的装置,其特征在于,所述预设时间包括多个,所述N个账号中的目标账号对应的M维数据包括所述目标账号在所述多个预设时间内访问所述目标网址的访问数据。The device according to any one of claims 10 to 12, wherein the preset time includes multiple, and the M-dimensional data corresponding to the target account in the N accounts includes the target account in the multiple Access data for accessing the target website within a preset time.
  14. 根据权利要求13所述的装置,其特征在于,所述获取单元具体用于:The device according to claim 13, wherein the acquiring unit is specifically configured to:
    获取所述N个账号在所述多个预设时间内访问所述目标网址的访问数据。Obtaining access data of the N accounts for accessing the target website within the multiple preset times.
  15. 根据权利要求10至12任一项所述的装置,其特征在于,所述确定单元将所述[0,1]区间中分布异常的访问数据对应的账号确定为异常账号之后,还包括:The device according to any one of claims 10 to 12, wherein after the determining unit determines that the account corresponding to the abnormally distributed access data in the interval [0,1] is an abnormal account, the method further comprises:
    添加单元,用于将所述异常账号添加到黑名单中,在下一次接收到所述异常账号发起的访问请求时,拒绝所述访问请求。The adding unit is configured to add the abnormal account to the blacklist, and reject the access request when the access request initiated by the abnormal account is received next time.
  16. 根据权利要求10至12任一项所述的装置,其特征在于,所述确定单元将所述[0,1]区间中分布异常的访问数据对应的账号确定为异常账号之后,还包括:The device according to any one of claims 10 to 12, wherein after the determining unit determines that the account corresponding to the abnormally distributed access data in the interval [0,1] is an abnormal account, the method further comprises:
    所述添加单元,用于对所述异常账号的互联网协议IP地址或者域名进行识别,将IP地址与所述异常账号的IP地址位于同一网段或者域名与所述异常账号的域名相同的其他账号设置为异常账号并添加到黑名单中;The adding unit is configured to identify the Internet Protocol IP address or domain name of the abnormal account, and locate the IP address and the IP address of the abnormal account in the same network segment or other accounts whose domain name is the same as the domain name of the abnormal account Set it as an abnormal account and add it to the blacklist;
    或者,用于将IP地址与所述异常账号的IP地址位于同一网段或者域名与所述异常账号的域名相同的其他账号设置为疑似异常的账号,将所述疑似异常的账号进行统计,并生成表格进行输出。Alternatively, it is used to set other accounts whose IP address is in the same network segment as the IP address of the abnormal account or whose domain name is the same as the domain name of the abnormal account as a suspected abnormal account, perform statistics on the suspected abnormal account, and Generate a table for output.
  17. 根据权利要求10至12任一项所述的装置,其特征在于,所述预设时间包括工作时间段对应的预设时间和/或非工作时间段对应的预设时间;所述预设时间包括一秒钟、一分钟、一小时、一天或一周中的任意一种。The device according to any one of claims 10 to 12, wherein the preset time includes a preset time corresponding to a working time period and/or a preset time corresponding to a non-working time period; the preset time Including any of one second, one minute, one hour, one day, or one week.
  18. 根据权利要求10至12任一项所述的装置,其特征在于,所述获取单元具体用于:The device according to any one of claims 10 to 12, wherein the acquiring unit is specifically configured to:
    按照预设周期获取所述N个账号在所述预设时间内访问所述目标网址的访问数据。Obtain access data for the N accounts to access the target website within the preset time according to a preset period.
  19. 一种计算机非易失性可读存储介质,其特征在于,所述计算机非易失性可读存储介质上存储有计算机程序,该程序被处理器执行时实现权利要求1至9任一项所述的异常账号检测方法。A computer non-volatile readable storage medium, wherein a computer program is stored on the computer non-volatile readable storage medium. The abnormal account detection method described above.
  20. 一种计算机设备,其特征在于,包括:一个或多个处理器;存储器;一个或多个应用程序,其中所述一个或多个应用程序被存储在所述存储器中并被配置为由所述一个或多个处理器执行,所述一个或多个应用程序配置用于执行权利要求1至9任一项所述的异常账号检测方法。A computer device, characterized by comprising: one or more processors; a memory; one or more application programs, wherein the one or more application programs are stored in the memory and configured to be used by the One or more processors are executed, and the one or more application programs are configured to execute the abnormal account detection method according to any one of claims 1 to 9.
PCT/CN2019/117581 2019-07-23 2019-11-12 Method, device, and computer storage medium for detecting abnormal account WO2021012509A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910669346.3A CN110460587B (en) 2019-07-23 2019-07-23 Abnormal account detection method and device and computer storage medium
CN201910669346.3 2019-07-23

Publications (1)

Publication Number Publication Date
WO2021012509A1 true WO2021012509A1 (en) 2021-01-28

Family

ID=68483151

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/117581 WO2021012509A1 (en) 2019-07-23 2019-11-12 Method, device, and computer storage medium for detecting abnormal account

Country Status (2)

Country Link
CN (1) CN110460587B (en)
WO (1) WO2021012509A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113344621A (en) * 2021-05-31 2021-09-03 北京百度网讯科技有限公司 Abnormal account determination method and device and electronic equipment
CN115603955A (en) * 2022-09-26 2023-01-13 北京百度网讯科技有限公司(Cn) Abnormal access object identification method, device, equipment and medium
CN116644372A (en) * 2023-07-24 2023-08-25 北京芯盾时代科技有限公司 Account type determining method and device, electronic equipment and storage medium
CN116842327A (en) * 2023-05-18 2023-10-03 中国地质大学(北京) Method, device and equipment for processing abnormal data in resource quantity evaluation
CN117235654A (en) * 2023-11-15 2023-12-15 中译文娱科技(青岛)有限公司 Artificial intelligence data intelligent processing method and system
CN116842327B (en) * 2023-05-18 2024-05-10 中国地质大学(北京) Method, device and equipment for processing abnormal data in resource quantity evaluation

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112162993A (en) * 2020-11-10 2021-01-01 平安普惠企业管理有限公司 Data updating method and device of blacklist and computer equipment
CN115189901B (en) * 2021-04-07 2024-02-06 北京达佳互联信息技术有限公司 Method and device for identifying abnormal request, server and storage medium
CN114971768B (en) * 2022-04-14 2024-03-05 中国电信股份有限公司 User account identification method and device, computer storage medium and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8719257B2 (en) * 2011-02-16 2014-05-06 Symantec Corporation Methods and systems for automatically generating semantic/concept searches
CN107563194A (en) * 2017-09-04 2018-01-09 杭州安恒信息技术有限公司 Latency steals user data behavioral value method and device
CN108809745A (en) * 2017-05-02 2018-11-13 中国移动通信集团重庆有限公司 A kind of user's anomaly detection method, apparatus and system

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104917643B (en) * 2014-03-11 2019-02-01 腾讯科技(深圳)有限公司 Abnormal account detection method and device
CN105471819B (en) * 2014-08-19 2019-08-30 腾讯科技(深圳)有限公司 Account method for detecting abnormality and device
CN110019074B (en) * 2017-12-30 2021-03-23 中国移动通信集团河北有限公司 Access path analysis method, device, equipment and medium
CN109743309B (en) * 2018-12-28 2021-09-10 微梦创科网络科技(中国)有限公司 Illegal request identification method and device and electronic equipment

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8719257B2 (en) * 2011-02-16 2014-05-06 Symantec Corporation Methods and systems for automatically generating semantic/concept searches
CN108809745A (en) * 2017-05-02 2018-11-13 中国移动通信集团重庆有限公司 A kind of user's anomaly detection method, apparatus and system
CN107563194A (en) * 2017-09-04 2018-01-09 杭州安恒信息技术有限公司 Latency steals user data behavioral value method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ZHANG YU , SHANG WEN-LI , ZHAO JIAN-MING , GAO HONG-WEI , ZHENG PENG: "Anomaly Detection Method for Powerlink Protocol Communication", COMPUTER ENGINEERING AND DESIGN, vol. 40, no. 1, 16 January 2019 (2019-01-16), pages 65 - 70, XP055775189, ISSN: 1000-7024, DOI: 10.16208/j.issn1000-7024.2019.01.011 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113344621A (en) * 2021-05-31 2021-09-03 北京百度网讯科技有限公司 Abnormal account determination method and device and electronic equipment
CN113344621B (en) * 2021-05-31 2023-08-04 北京百度网讯科技有限公司 Determination method and device for abnormal account and electronic equipment
CN115603955A (en) * 2022-09-26 2023-01-13 北京百度网讯科技有限公司(Cn) Abnormal access object identification method, device, equipment and medium
CN115603955B (en) * 2022-09-26 2023-11-07 北京百度网讯科技有限公司 Abnormal access object identification method, device, equipment and medium
CN116842327A (en) * 2023-05-18 2023-10-03 中国地质大学(北京) Method, device and equipment for processing abnormal data in resource quantity evaluation
CN116842327B (en) * 2023-05-18 2024-05-10 中国地质大学(北京) Method, device and equipment for processing abnormal data in resource quantity evaluation
CN116644372A (en) * 2023-07-24 2023-08-25 北京芯盾时代科技有限公司 Account type determining method and device, electronic equipment and storage medium
CN116644372B (en) * 2023-07-24 2023-11-03 北京芯盾时代科技有限公司 Account type determining method and device, electronic equipment and storage medium
CN117235654A (en) * 2023-11-15 2023-12-15 中译文娱科技(青岛)有限公司 Artificial intelligence data intelligent processing method and system
CN117235654B (en) * 2023-11-15 2024-03-22 中译文娱科技(青岛)有限公司 Artificial intelligence data intelligent processing method and system

Also Published As

Publication number Publication date
CN110460587A (en) 2019-11-15
CN110460587B (en) 2022-01-25

Similar Documents

Publication Publication Date Title
WO2021012509A1 (en) Method, device, and computer storage medium for detecting abnormal account
US10878102B2 (en) Risk scores for entities
US9838422B2 (en) Detecting denial-of-service attacks on graph databases
CN113489713B (en) Network attack detection method, device, equipment and storage medium
CN110012005B (en) Method and device for identifying abnormal data, electronic equipment and storage medium
CN107992738B (en) Account login abnormity detection method and device and electronic equipment
EP3817333B1 (en) Method and system for processing requests in a consortium blockchain
JP7069399B2 (en) Systems and methods for reporting computer security incidents
CN108234426B (en) APT attack warning method and APT attack warning device
CN112839014A (en) Method, system, device and medium for establishing model for identifying abnormal visitor
CN112306700A (en) Abnormal RPC request diagnosis method and device
CN111669379A (en) Behavior abnormity detection method and device
CN111371757B (en) Malicious communication detection method and device, computer equipment and storage medium
CN111756745A (en) Alarm method, alarm device and terminal equipment
CN112765502B (en) Malicious access detection method, device, electronic equipment and storage medium
TW201822054A (en) Network attack pattern determination apparatus, determination method, and computer program product thereof
US11743284B2 (en) Multi-factor illicit enumeration detection
CN113010494A (en) Database auditing method and device and database proxy server
CN111131166B (en) User behavior prejudging method and related equipment
WO2016173327A1 (en) Method and device for detecting website attack
US20230118341A1 (en) Inline validation of machine learning models
CN115643044A (en) Data processing method, device, server and storage medium
CN114221807A (en) Access request processing method and device, monitoring equipment and storage medium
WO2020199029A1 (en) Data processing method and apparatus therefor
CN113032774A (en) Training method, device and equipment of anomaly detection model and computer storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19938269

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19938269

Country of ref document: EP

Kind code of ref document: A1