CN116055119A - Fraudulent user identification method and device based on flow data - Google Patents

Fraudulent user identification method and device based on flow data Download PDF

Info

Publication number
CN116055119A
CN116055119A CN202211635597.8A CN202211635597A CN116055119A CN 116055119 A CN116055119 A CN 116055119A CN 202211635597 A CN202211635597 A CN 202211635597A CN 116055119 A CN116055119 A CN 116055119A
Authority
CN
China
Prior art keywords
path
class
flow data
hot spot
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211635597.8A
Other languages
Chinese (zh)
Inventor
陈刚
邓巧华
高霞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongtong Uniform Chuangfa Science And Technology Co ltd
Original Assignee
Zhongtong Uniform Chuangfa Science And Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongtong Uniform Chuangfa Science And Technology Co ltd filed Critical Zhongtong Uniform Chuangfa Science And Technology Co ltd
Priority to CN202211635597.8A priority Critical patent/CN116055119A/en
Publication of CN116055119A publication Critical patent/CN116055119A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/142Network analysis or design using statistical or mathematical methods

Abstract

The embodiment of the disclosure provides a fraudulent user identification method and device based on flow data. The method comprises the following steps: determining the visited path and the path score of the website according to the traffic data of the website; screening the traffic data of the access path from the traffic data of the website, wherein the screening access path comprises the traffic data of the path with the path score higher than or equal to a preset threshold value; clustering the screened flow data according to the access path of the screened flow data; determining a hot spot path of each class according to the access path of each flow data in each class; according to the hot spot path of each class, calculating the hot spot path deviation degree of each flow data in each class; and identifying the fraudulent user according to the hot spot path deviation degree of each flow data in each class. In this way, the identification effect of the fraudulent user can be effectively improved.

Description

Fraudulent user identification method and device based on flow data
Technical Field
The disclosure relates to the technical field of network security, in particular to a fraudulent user identification method and device based on flow data.
Background
With the continuous development of the internet, in order for normal users to better enjoy services, it is necessary to identify users to determine fraudulent users who are intermixed with the normal users.
At present, most internet enterprises identify fraudulent users through a rule engine and offline investigation, but the scheme has the problems of smaller coverage and low accuracy. Therefore, how to improve the identification effect of the fraudulent user is a technical problem to be solved.
Disclosure of Invention
The disclosure provides a fraudulent user identification method and device based on flow data.
In a first aspect, embodiments of the present disclosure provide a method for identifying a rogue user based on traffic data, the method comprising:
determining the visited path and the path score of the website according to the traffic data of the website;
screening the traffic data of the access path from the traffic data of the website, wherein the screening access path comprises the traffic data of the path with the path score higher than or equal to a preset threshold value;
clustering the screened flow data according to the access path of the screened flow data;
determining a hot spot path of each class according to the access path of each flow data in each class;
according to the hot spot path of each class, calculating the hot spot path deviation degree of each flow data in each class;
and identifying the fraudulent user according to the hot spot path deviation degree of each flow data in each class.
In some implementations of the first aspect, determining a path along which the website is visited and a path score thereof based on traffic data of the website includes:
Performing asset tree carding on the flow data of the website to obtain an asset tree of the website, wherein the asset tree represents a visited path of the website in a tree structure;
calculating basic scores, algorithm scores, adjustment scores and supplementary information scores of all paths in the asset tree according to the flow data corresponding to the paths in the asset tree;
and calculating the path scores of all paths in the asset tree according to the basic scores, the algorithm scores, the adjustment scores and the supplementary information scores of all paths in the asset tree.
In some implementations of the first aspect, clustering the screened traffic data according to an access path of the screened traffic data includes;
according to the session id of the screened flow data, the flow data of the same session id is aggregated into integrated flow data;
normalizing and vectorizing the path jump of each integrated flow data to obtain the path access characteristics of each integrated flow data;
and clustering the integrated flow data according to the path access characteristics of the integrated flow data.
In some implementations of the first aspect, determining a hotspot path for each class according to the access path for the traffic data in each class includes:
Counting the number of the flow data involved in each access path in the corresponding class;
dividing the number of the flow data related to each access path in the corresponding class by the total number of the flow data in the corresponding class to obtain a hot spot coefficient of each access path in the corresponding class;
and determining the access path with the hot spot coefficient larger than or equal to a preset threshold value in the access path corresponding to each class as a hot spot path.
In some implementations of the first aspect, calculating, according to the hotspot path of each class, a hotspot path deviation of each traffic data in each class includes:
counting the flow data in each class to obtain the statistical data of the flow data in each class, wherein the statistical data comprises: the time consumption of the hot spot paths, the access proportion of the hot spot paths, the proportion of the number of independent hot spot paths to the total number of the hot spot paths, and the total access number of the hot spot paths;
calculating the time-consuming deviation degree of the hot spot paths of the flow data in each class, the deviation degree of the access proportion of the hot spot paths, the deviation degree of the proportion of the number of independent hot spot paths to the total number of the hot spot paths and the deviation degree of the total access quantity of the hot spot paths according to the statistical data of the flow data in each class;
And calculating the hot spot path deviation degree of each flow data in each class according to each deviation degree of each flow data in each class.
In some implementations of the first aspect, calculating the hotspot path deviation of the traffic data in each class according to the deviation of the traffic data in each class includes:
and for any traffic data in each class, weighting and summing the deviation degrees of the traffic data to obtain the hot spot path deviation degree of the traffic data.
In some implementations of the first aspect, identifying fraudulent users based on hot spot path deviations of traffic data in each class includes:
and determining the user corresponding to the traffic data with the hot spot path deviation degree larger than or equal to the preset threshold value as a fraudulent user.
In a second aspect, embodiments of the present disclosure provide a traffic data-based fraud user identification apparatus, the apparatus comprising:
the determining module is used for determining the visited path and the path score thereof according to the flow data of the website;
the screening module is used for screening the flow data of the paths with the path scores higher than or equal to a preset threshold value from the flow data of the websites;
the clustering module is used for clustering the screened flow data according to the access path of the screened flow data;
The determining module is also used for determining a hot spot path of each class according to the access path of each flow data in each class;
the calculation module is used for calculating the hot spot path deviation degree of each flow data in each class according to the hot spot path of each class;
and the identification module is used for identifying the fraudulent user according to the hot spot path deviation degree of each flow data in each class.
In a third aspect, embodiments of the present disclosure provide an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described above.
In a fourth aspect, embodiments of the present disclosure provide a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform a method as described above.
In the embodiment of the disclosure, the importance degree of different paths accessed by the website can be determined according to the traffic data of the website, the traffic data aiming at irrelevant paths is filtered, and the screened traffic data is clustered on the basis, so that the partner users are more pertinently judged, and the fraudulent users in the partner users are efficiently identified according to the class hot-spot paths, so that indiscriminate judgment on the partner users is avoided, and the identification effect of the fraudulent users is greatly improved.
It should be understood that what is described in this summary is not intended to limit the critical or essential features of the embodiments of the disclosure nor to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The above and other features, advantages and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. For a better understanding of the present disclosure, and without limiting the disclosure thereto, the same or similar reference numerals denote the same or similar elements, wherein:
FIG. 1 illustrates a flow chart of a method of fraudulent user identification based on traffic data provided by an embodiment of the present disclosure;
FIG. 2 illustrates a schematic diagram of a pseudo-static resource provided by an embodiment of the present disclosure;
FIG. 3 illustrates a block diagram of a traffic data based fraud user identification device provided by an embodiment of the present disclosure;
fig. 4 illustrates a block diagram of an exemplary electronic device capable of implementing embodiments of the present disclosure.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are some embodiments of the present disclosure, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the disclosure, are within the scope of the disclosure.
In addition, the term "and/or" herein is merely an association relationship describing an association object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.
In view of the problems occurring in the background art, embodiments of the present disclosure provide a method and an apparatus for identifying fraudulent users based on traffic data. Specifically, according to the traffic data of the website, determining the visited path of the website and the path score thereof; screening the traffic data of the access path from the traffic data of the website, wherein the screening access path comprises the traffic data of the path with the path score higher than or equal to a preset threshold value; clustering the screened flow data according to the access path of the screened flow data; determining a hot spot path of each class according to the access path of each flow data in each class; according to the hot spot path of each class, calculating the hot spot path deviation degree of each flow data in each class; and identifying the fraudulent user according to the hot spot path deviation degree of each flow data in each class.
Therefore, the importance degree of different paths visited by the website can be determined according to the traffic data of the website, the traffic data aiming at irrelevant paths is filtered, and the screened traffic data is clustered on the basis, so that the partner users are more pertinently judged, the fraudulent users in the partner users are efficiently identified according to the hot spot paths of the class, indiscriminate judgment of the partner users is avoided, and the identification effect of the fraudulent users is greatly improved.
The method and apparatus for identifying fraudulent users based on traffic data provided by the embodiments of the present disclosure are described in detail below with reference to the accompanying drawings by way of specific embodiments.
Fig. 1 shows a flowchart of a method for identifying a fraudulent user based on traffic data according to an embodiment of the present disclosure, and as shown in fig. 1, a method 100 for identifying a fraudulent user may include the following steps:
s110, determining the visited path and the path score thereof according to the flow data of the website.
In some embodiments, the full traffic data of the website may be obtained, and then asset tree combing is performed on the traffic data of the website to obtain an asset tree of the website, where the asset tree represents a path of the website that is accessed in a tree structure.
And calculating the basic score, the algorithm score, the adjustment score and the supplementary information score of each path in the asset tree according to the flow data corresponding to the paths in the asset tree.
And calculating the path scores of all paths in the asset tree according to the basic scores, the algorithm scores, the adjustment scores and the supplementary information scores of all paths in the asset tree. For example, for any path in the asset tree, the base score, algorithm score, adjustment score, and supplemental information score are weighted and summed to obtain the path score.
In this way, path scores may be calculated from multiple dimensions, increasing the trustworthiness of the path scores.
S120, screening the traffic data of the access path from the traffic data of the website, wherein the screened access path comprises the traffic data of the path with the path score higher than or equal to a preset threshold value.
In short, according to the scores (i.e., importance degrees) of different paths accessed by the website, the traffic data for important paths (i.e., paths with higher importance degrees) are screened out, and meanwhile, the traffic data for irrelevant paths (i.e., paths with lower importance degrees) are filtered out, so that the accuracy of the identification of the subsequent fraudulent users is effectively improved.
S130, clustering the screened flow data according to the access path of the screened flow data.
In some embodiments, according to the session ids of the screened traffic data, the traffic data of the same session id may be aggregated into integrated traffic data, that is, the traffic data of the same session id may be aggregated into an integral corresponding to the session id.
And normalizing and vectorizing the path jumps of the integrated flow data to obtain the path access characteristics of the integrated flow data. For example, path hops (including url and reference) are normalized and vectorized using a multi-hot algorithm, i.e., path hops within one request are used as one dimension, and path hops for all requests within traffic data are calculated as corresponding dimensional features, i.e., path access features.
And clustering the integrated flow data according to the path access characteristics of the integrated flow data. For example, the path access features of the integrated flow data are clustered by using a DBSCAN algorithm, so that the integrated flow data are clustered.
S140, according to the access paths of the flow data in each class, determining the hot spot path of each class.
In some embodiments, the number of traffic data involved in each access path in its corresponding class may be counted, the number of traffic data involved in each access path in its corresponding class is divided by the total number of traffic data in the corresponding class, so as to obtain a hotspot coefficient of each access path in its corresponding class, and an access path with a hotspot coefficient greater than or equal to a preset threshold in each class corresponding to the access path is determined as a hotspot path.
That is, if a path appears in most of the traffic data within a class, that path can be considered a hotspot path for the class.
S150, calculating the hot spot path deviation degree of each flow data in each class according to the hot spot path of each class.
In some embodiments, statistics may be performed on the traffic data in each class to obtain statistics of the traffic data in each class, where the statistics may include: the hot spot path time consumption, the hot spot path access proportion, the proportion of the number of independent hot spot paths to the total hot spot path number, and the total access number of the hot spot paths.
And calculating the time-consuming deviation degree of the hot spot paths of the flow data in each class, the deviation degree of the access proportion of the hot spot paths, the deviation degree of the proportion of the number of independent hot spot paths to the total number of the hot spot paths and the deviation degree of the total access number of the hot spot paths according to the statistical data of the flow data in each class.
And calculating the hot spot path deviation degree of each flow data in each class according to each deviation degree of each flow data in each class. For example, for any traffic data in each class, the deviation degrees are weighted and summed to obtain the hot spot path deviation degree of the traffic data.
In this way, the hot spot path deviation degree of the flow data can be calculated from multiple dimensions, and the reliability of the hot spot path deviation degree is improved.
S160, identifying the fraudulent user according to the hot spot path deviation degree of each flow data in each class.
Specifically, a user corresponding to traffic data having a hotspot path deviation degree greater than or equal to a preset threshold may be determined as a rogue user.
In the embodiment of the disclosure, the importance degree of different paths accessed by the website can be determined according to the traffic data of the website, the traffic data aiming at irrelevant paths is filtered, and the screened traffic data is clustered on the basis, so that the partner users are more pertinently judged, and the fraudulent users in the partner users are efficiently identified according to the class hot-spot paths, so that indiscriminate judgment on the partner users is avoided, and the identification effect of the fraudulent users is greatly improved.
The following describes in detail a method for identifying a fraudulent user according to an embodiment of the present disclosure with reference to a specific embodiment, which is specifically as follows:
the asset tree combing for flow data is mainly used for processing static resources, pseudo-static resources, dynamic resources and common resources, and comprises merging of the static resources, parameter standardization of the pseudo-static resources and setting of the dynamic resources.
The pseudo-static resource is shown in fig. 2, and the pseudo-static resource is a path containing parameters in the path, and the path with the keyword 1 searched through hundred degrees is shown in fig. 2, wherein 1 is just a parameter, and different search contents are only parameters after the keyword are different, so that the parameters in the path need to be standardized when processing the traffic data, and the analysis of the user behavior through the path is more accurate.
The pseudo-static resource normalization process is as follows: 1. changing the path into a graph form, representing the content of each part in the form of keys and values, for example, https is expressed as scheme, https, because a certain part can have a plurality of combinations, according to different combinations in the data, if the keys are named as well, the keys are sequentially arranged according to the size of the numbers, and if the keys are named as well, the other parts are constructed according to the actual parameter names; 2. after the combination of the key and the value is constructed, merging is carried out according to the same key, if different value values corresponding to the same key in the whole data set exceed a certain number, the corresponding value is judged as a parameter value, and standardized processing is needed to be carried out on related content; 3. and generating a corresponding standardization rule for the paths containing the parameters, and carrying out standardization processing on other subsequent paths.
For static resources, paths of specific pictures and file types are mainly referred to, the paths are named by specific file suffixes, and when the static resources are processed, the related paths are subjected to standardized processing directly according to whether ico, gif, bmp, jpg, css and the like are contained in the path suffixes, so that the static resources of different types are formally classified into assets of one type.
The dynamic resource mainly refers to a path with specific functions, including a path with sensitive information such as login, logout, payment and the like, and a path with specific behavior can be represented from the path content, and the path is used as the dynamic resource in the processing process and belongs to an asset with relatively large weight.
In addition to the three paths, other paths belong to common resources. After the path resources are combed, a tree structure of all paths is formed, and all behavior modes of the whole flow data are formed, so that clearer data are provided for subsequent algorithms for user behavior analysis and fraudulent party account detection.
The method comprises the steps that each path from a root node to a leaf node is required to be scored in an asset tree, namely, the carded paths are weighted and scored, and the scoring is mainly carried out by setting corresponding scoring rules in relevant field information which can reflect whether the request is normal or not in flow data, so that the weight condition of the paths in the whole data is obtained. In the scoring process, the basic score and the algorithm score are calculated first and the added total score is obtained, and then the weight adjustment stage is performed (the weight adjustment only involves the basic score, the algorithm score and the total score). Firstly, adjusting weights according to asset classification (for example, the static resources are greatly reduced), then, reducing weights according to the ajax duty ratio of the path (not ajax does not reduce weights), then, reducing weights according to the corresponding proportion under the two conditions of high access quantity, few response types, high access quantity and low jump logic importance, and finally obtaining the total score.
Basic scoring: reflecting the impact of the base state in a single request on the path asset. The basic score can score according to the field value in the data, and can also be configured to perform secondary regular extraction on the field. After scoring different values according to configuration, the scores are weighted twice according to the TF-IDF algorithm, and the rarer values get higher weights. The basic scores comprise behavior scores, namely methods adopted by requests of flow data, including modes of GET, POST and the like, and different scores are preset in different modes, for example, the GET is 10 scores and the POST is 15 scores; the state grading, namely, the grades of different state codes are requested, the state codes are started by which number, the state codes started by 2 are connected with the lower normal grades, the state codes started by other numbers are connected with the problem, and the grades are higher; cookie scoring, namely whether the cookie field in the data contains content or not; the session score is that whether relevant session information exists or not is detected through regularization; error status score, i.e., whether there is a particular status 404, etc.
Because each score is statically set and has a certain deviation in the real data, dynamic adjustment is needed according to the specific data, and the adjustment adopts a TF-IDF algorithm, and the static score of a single rule score in each score is weighted according to the occurrence number of the content contained in different score items in each path and whether all paths contain corresponding content, namely the occurrence rareness. Implementation a certain scoring term, if present in most requests, reduces the weight of such scoring, reducing the impact of such scoring on the final score.
Each request for traffic data gives the five weighted scores described above, but for a particular asset, multiple occurrences in the traffic data and therefore the scores for each asset need to be aggregated accordingly. For behavior scoring, the score obtained by a specific path asset is the maximum value of all corresponding requests; for state scoring, an average value of all corresponding requests is adopted; for cookie scoring, adopting a maximum value for aggregation; for both session scores and error status scores, an average is used.
Algorithm scoring: reflecting the impact of path pointing in the traffic data on path assets. Source and access score: the source number/access number is multiplied by a fixed coefficient after log processing, so that the size of the data quantity cannot have great influence on the score. Jump logic is divided into: the corresponding values are calculated through the Page Rank algorithm, mainly the calculation of the interconnection of paths, reflecting the importance of the paths in the data, and if many paths point to one path, the path may be important relative to other paths.
Adjusting the score: reflecting the impact of the path response on the path asset. Response score: and carrying out md5 hash processing on the response content, calculating the number of independent values of the hash of the response content after processing, carrying out log processing and multiplying the number by a fixed coefficient. ajax is a step of judging whether each log is ajax or not (specifically, X-Requested-with= XMLR request is used), and finally, the statistical path is the proportion of ajax calls.
Supplementary information scoring: reflecting the impact of other relevant information on the flow data. API identification score: whether the path is likely to be an API interface is identified according to the format of whether the content type field is hmtl/xml and the like, and finally the corresponding score is given.
Firstly, filtering the accesses with the corresponding asset scores lower than a certain threshold, namely the flow data, eliminating the accesses aiming at irrelevant paths, and reducing the influence of a plurality of noise paths on a subsequent algorithm.
Since the data generally does not contain a user ID, the session ID is extracted from the cookie, and a plurality of accesses with the session are aggregated into a whole of one session, which represents the access behavior of the session. To aid in focusing the effective session during polymerization, the following settings are available:
because part of the session is far longer than other sessions, the overlength session can be cut into a plurality of sessions, and if the adjacent two operation times are longer than 30 minutes, the session is cut off from the time point;
subsequently, the session may be further screened for items including:
(1) session needs to cover how many times of requests for the key path;
(2) The session covers a critical path and must include specific keywords, namely some sensitive paths.
After the operation is performed, a plurality of sessions are generated, the ID of each session is formed by splicing the extracted session ID and the cut-off time, and the sessions contain all relevant key asset accesses and have a certain length. Subsequent clustering and other operations are performed by taking session as a unit, and a multi-hot algorithm is used for normalizing and vectorizing path hops (including url and reference), namely, path hops in one request are used as a dimension, and path hops of all requests in the session are calculated as corresponding dimension features.
After preprocessing, the algorithm clusters the vectorized sessions by using a DBSCAN algorithm, and the similarity distance is used for judging the deviation degree between the sessions during clustering. The algorithm will connect the sessions that are less than the threshold distance from each other and if one session connects multiple sessions at the same time, it will start to form a group, i.e., class. It should be noted that, unlike most clustering algorithms, the algorithm will also generate a non-group formed session, which is counted in group-1. The clustering algorithm needs to specify the following information:
(1) A distance threshold between sessions;
(2) The minimum number of connections between sessions required for clustering.
After the session group is generated, if a path appears in sessions exceeding a certain threshold proportion in the group, the path is considered to be a hot spot path corresponding to the session. In the algorithm, the same path is completely likely to appear in the hot spot path list of a plurality of groups.
After clustering, according to the generated hot spot paths and related statistical data, the statistical data comprise 5 hot spot path time consumption of each session, hot spot path access proportion of each session, proportion of independent hot spot paths of each session to total hot spot paths and total access quantity of the hot spot paths.
In actual scoring, each session is scored against the statistics of the group
The items include:
0 (1) degree of deviation of hot spot path time consumption;
(2) Deviation of hot spot path access proportion;
(3) Deviation degree of the proportion of the number of independent hot spot paths to the total number of hot spot paths;
(4) Deviation of the total number of accesses of the hot spot path.
Wherein, the deviation degree is defined as (session of the statistics-group of the statistics average value)/5 groups of the statistics standard deviation, wherein, the time deviation degree is additionally multiplied by-1 on the basis, so as to represent that the shorter the time is, the more abnormal the time is. The final total score is a weighted post-summation of the degree of deviation of each term, and the following two scores are additionally added:
(1) The sum of the deviation degree of the proportion of the number of the independent hot spot paths to the total hot spot paths in the session access
When the time-consuming deviation degree on the hot spot path is larger than 0, adding extra points according to the deviation degree of the two; 0 (2) when the total access on the hot-spot path exceeds a certain higher threshold, the weight is additionally reduced, which is the case
Common are users who repeatedly refresh the same business or other unskilled work activities.
And setting a corresponding threshold value by scoring the hotspot path deviation degree reflecting the user behavior, and determining the user corresponding to the session with the hotspot path deviation degree being greater than or equal to the threshold value as the fraudulent user.
It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present disclosure is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present disclosure. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all alternative embodiments, and that the acts and modules referred to are not necessarily required by the present disclosure.
The foregoing is a description of embodiments of the method, and the following further describes embodiments of the present disclosure through examples of apparatus.
FIG. 3 illustrates a block diagram of a traffic data based fraud user identification device provided by an embodiment of the present disclosure, as illustrated in FIG. 3, a fraud user identification device 300 may include:
the determining module 310 is configured to determine a path visited by the website and a path score thereof according to the traffic data of the website.
The filtering module 320 is configured to filter, from the traffic data of the website, traffic data of a path whose path score is higher than or equal to a preset threshold.
And the clustering module 330 is configured to cluster the screened traffic data according to the access path of the screened traffic data.
The determining module 310 is further configured to determine a hotspot path of each class according to the access paths of the traffic data in each class.
And the calculating module 340 is configured to calculate a hotspot path deviation degree of each traffic data in each class according to the hotspot path of each class.
And the identification module 350 is used for identifying the fraudulent user according to the hot spot path deviation degree of each traffic data in each class.
In some embodiments, the determining module 310 is specifically configured to:
performing asset tree carding on the flow data of the website to obtain an asset tree of the website, wherein the asset tree represents a visited path of the website in a tree structure;
calculating basic scores, algorithm scores, adjustment scores and supplementary information scores of all paths in the asset tree according to the flow data corresponding to the paths in the asset tree;
and calculating the path scores of all paths in the asset tree according to the basic scores, the algorithm scores, the adjustment scores and the supplementary information scores of all paths in the asset tree.
In some embodiments, the clustering module 330 is specifically configured to:
according to the session id of the screened flow data, the flow data of the same session id is aggregated into integrated flow data;
normalizing and vectorizing the path jump of each integrated flow data to obtain the path access characteristics of each integrated flow data;
and clustering the integrated flow data according to the path access characteristics of the integrated flow data.
In some embodiments, the determining module 310 is specifically configured to:
counting the number of the flow data involved in each access path in the corresponding class;
dividing the number of the flow data related to each access path in the corresponding class by the total number of the flow data in the corresponding class to obtain a hot spot coefficient of each access path in the corresponding class;
and determining the access path with the hot spot coefficient larger than or equal to a preset threshold value in the access path corresponding to each class as a hot spot path.
In some embodiments, the computing module 340 is specifically configured to:
counting the flow data in each class to obtain the statistical data of the flow data in each class, wherein the statistical data comprises: the time consumption of the hot spot paths, the access proportion of the hot spot paths, the proportion of the number of independent hot spot paths to the total number of the hot spot paths, and the total access number of the hot spot paths;
Calculating the time-consuming deviation degree of the hot spot paths of the flow data in each class, the deviation degree of the access proportion of the hot spot paths, the deviation degree of the proportion of the number of independent hot spot paths to the total number of the hot spot paths and the deviation degree of the total access quantity of the hot spot paths according to the statistical data of the flow data in each class;
and calculating the hot spot path deviation degree of each flow data in each class according to each deviation degree of each flow data in each class.
In some embodiments, the computing module 340 is specifically configured to:
and for any traffic data in each class, weighting and summing the deviation degrees of the traffic data to obtain the hot spot path deviation degree of the traffic data.
In some embodiments, the identification module 350 is specifically configured to:
and determining the user corresponding to the traffic data with the hot spot path deviation degree larger than or equal to the preset threshold value as a fraudulent user.
It will be appreciated that each module/unit in the fraudulent user identification apparatus 300 shown in fig. 3 has a function of implementing each step in the fraudulent user identification method 100 shown in fig. 1, and can achieve its corresponding technical effect, and for brevity, will not be described in detail herein.
Fig. 4 illustrates a block diagram of an exemplary electronic device capable of implementing embodiments of the present disclosure. Electronic device 400 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic device 400 may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 4, the electronic device 400 may include a computing unit 401 that may perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM) 402 or a computer program loaded from a storage unit 408 into a Random Access Memory (RAM) 403. In the RAM403, various programs and data required for the operation of the electronic device 400 may also be stored. The computing unit 401, ROM402, and RAM403 are connected to each other by a bus 404. An input/output (I/O) interface 405 is also connected to bus 404.
Various components in electronic device 400 are connected to I/O interface 405, including: an input unit 406 such as a keyboard, a mouse, etc.; an output unit 407 such as various types of displays, speakers, and the like; a storage unit 408, such as a magnetic disk, optical disk, etc.; and a communication unit 409 such as a network card, modem, wireless communication transceiver, etc. The communication unit 409 allows the electronic device 400 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.
The computing unit 401 may be a variety of general purpose and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 401 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 401 performs the various methods and processes described above, such as method 100. For example, in some embodiments, the method 100 may be implemented as a computer program product, including a computer program, tangibly embodied on a computer-readable medium, such as the storage unit 408. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 400 via the ROM402 and/or the communication unit 409. One or more of the steps of the method 100 described above may be performed when a computer program is loaded into RAM403 and executed by the computing unit 401. Alternatively, in other embodiments, the computing unit 401 may be configured to perform the method 100 by any other suitable means (e.g., by means of firmware).
The various embodiments described above herein may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems-on-a-chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a computer-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a computer-readable storage medium would include one or more wire-based electrical connections, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
It should be noted that the present disclosure further provides a non-transitory computer readable storage medium storing computer instructions, where the computer instructions are configured to cause a computer to perform the method 100 and achieve corresponding technical effects achieved by performing the method according to the embodiments of the present disclosure, which are not described herein for brevity.
In addition, the present disclosure also provides a computer program product comprising a computer program which, when executed by a processor, implements the method 100.
To provide for interaction with a user, the embodiments described above may be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The above-described embodiments may be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.
The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims (10)

1. A method for identifying fraudulent users based on traffic data, the method comprising:
determining the visited path and the path score of the website according to the traffic data of the website;
screening the traffic data of the access path from the traffic data of the website, wherein the screening access path comprises the traffic data of the path with the path score higher than or equal to a preset threshold value;
clustering the screened flow data according to the access path of the screened flow data;
determining a hot spot path of each class according to the access path of each flow data in each class;
according to the hot spot path of each class, calculating the hot spot path deviation degree of each flow data in each class;
and identifying the fraudulent user according to the hot spot path deviation degree of each flow data in each class.
2. The method of claim 1, wherein determining the path along which the web site was visited and its path score based on the traffic data of the web site comprises:
performing asset tree carding on the flow data of the website to obtain an asset tree of the website, wherein the asset tree represents a visited path of the website in a tree structure;
calculating basic scores, algorithm scores, adjustment scores and supplementary information scores of all paths in the asset tree according to the flow data corresponding to the paths in the asset tree;
And calculating the path scores of all paths in the asset tree according to the basic scores, the algorithm scores, the adjustment scores and the supplementary information scores of all paths in the asset tree.
3. The method of claim 1, wherein the clustering the screened traffic data according to the access path of the screened traffic data comprises;
according to the session id of the screened flow data, the flow data of the same session id is aggregated into integrated flow data;
normalizing and vectorizing the path jump of each integrated flow data to obtain the path access characteristics of each integrated flow data;
and clustering the integrated flow data according to the path access characteristics of the integrated flow data.
4. The method of claim 1, wherein determining the hotspot path for each class based on the access paths for the traffic data in each class comprises:
counting the number of the flow data involved in each access path in the corresponding class;
dividing the number of the flow data related to each access path in the corresponding class by the total number of the flow data in the corresponding class to obtain a hot spot coefficient of each access path in the corresponding class;
And determining the access path with the hot spot coefficient larger than or equal to a preset threshold value in the access path corresponding to each class as a hot spot path.
5. The method of claim 1, wherein calculating the hotspot path deviation of the traffic data in each class according to the hotspot path of each class comprises:
counting the flow data in each class to obtain the statistical data of the flow data in each class, wherein the statistical data comprises: the time consumption of the hot spot paths, the access proportion of the hot spot paths, the proportion of the number of independent hot spot paths to the total number of the hot spot paths, and the total access number of the hot spot paths;
calculating the time-consuming deviation degree of the hot spot paths of the flow data in each class, the deviation degree of the access proportion of the hot spot paths, the deviation degree of the proportion of the number of independent hot spot paths to the total number of the hot spot paths and the deviation degree of the total access quantity of the hot spot paths according to the statistical data of the flow data in each class;
and calculating the hot spot path deviation degree of each flow data in each class according to each deviation degree of each flow data in each class.
6. The method of claim 5, wherein calculating the hotspot path deviation of the traffic data in each class based on the deviations of the traffic data in each class, comprises:
And for any traffic data in each class, weighting and summing the deviation degrees of the traffic data to obtain the hot spot path deviation degree of the traffic data.
7. The method of any of claims 1-6, wherein identifying fraudulent users based on the hot spot path deviations of the traffic data in each class comprises:
and determining the user corresponding to the traffic data with the hot spot path deviation degree larger than or equal to the preset threshold value as a fraudulent user.
8. A traffic data-based fraud subscriber identification apparatus, the apparatus comprising:
the determining module is used for determining the visited path and the path score thereof according to the flow data of the website;
the screening module is used for screening the flow data of the paths with the path scores higher than or equal to a preset threshold value from the flow data of the websites;
the clustering module is used for clustering the screened flow data according to the access path of the screened flow data;
the determining module is further configured to determine a hotspot path of each class according to the access path of each flow data in each class;
the calculation module is used for calculating the hot spot path deviation degree of each flow data in each class according to the hot spot path of each class;
And the identification module is used for identifying the fraudulent user according to the hot spot path deviation degree of each flow data in each class.
9. An electronic device, the electronic device comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.
10. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1-7.
CN202211635597.8A 2022-12-19 2022-12-19 Fraudulent user identification method and device based on flow data Pending CN116055119A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211635597.8A CN116055119A (en) 2022-12-19 2022-12-19 Fraudulent user identification method and device based on flow data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211635597.8A CN116055119A (en) 2022-12-19 2022-12-19 Fraudulent user identification method and device based on flow data

Publications (1)

Publication Number Publication Date
CN116055119A true CN116055119A (en) 2023-05-02

Family

ID=86115497

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211635597.8A Pending CN116055119A (en) 2022-12-19 2022-12-19 Fraudulent user identification method and device based on flow data

Country Status (1)

Country Link
CN (1) CN116055119A (en)

Similar Documents

Publication Publication Date Title
US11546223B2 (en) Systems and methods for conducting more reliable assessments with connectivity statistics
US11336681B2 (en) Malware data clustering
US11386435B2 (en) System and method for global third party intermediary identification system with anti-bribery and anti-corruption risk assessment
US10135788B1 (en) Using hypergraphs to determine suspicious user activities
CN105590055B (en) Method and device for identifying user credible behaviors in network interaction system
US9922134B2 (en) Assessing and scoring people, businesses, places, things, and brands
JP5735969B2 (en) System and method for analyzing social graph data for determining connections within a community
US10311106B2 (en) Social graph visualization and user interface
JP2018538587A (en) Risk assessment method and system
TW201944306A (en) Method and device for determining high-risk user
US10673979B2 (en) User data sharing method and device
CN115145587A (en) Product parameter checking method and device, electronic equipment and storage medium
CN110751354B (en) Abnormal user detection method and device
WO2019095569A1 (en) Financial analysis method based on financial and economic event on microblog, application server, and computer readable storage medium
CN113791837A (en) Page processing method, device, equipment and storage medium
CN110555092A (en) Public opinion processing method and device and computer readable storage medium
CN116055119A (en) Fraudulent user identification method and device based on flow data
CN113642919A (en) Risk control method, electronic device, and storage medium
CN113723522B (en) Abnormal user identification method and device, electronic equipment and storage medium
CN112861034B (en) Method, device, equipment and storage medium for detecting information
CN111583037B (en) Method and device for determining risk associated object and server
CN116934417A (en) Object recognition method, device, computer equipment, storage medium and program product
CN116167635A (en) Method and device for improving evaluation accuracy
CN116961974A (en) Network anomaly detection method, device and storage medium
CN116739605A (en) Transaction data detection method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination