CN112667709B - Campus card leasing behavior detection method and system based on Spark - Google Patents

Campus card leasing behavior detection method and system based on Spark Download PDF

Info

Publication number
CN112667709B
CN112667709B CN202011553092.8A CN202011553092A CN112667709B CN 112667709 B CN112667709 B CN 112667709B CN 202011553092 A CN202011553092 A CN 202011553092A CN 112667709 B CN112667709 B CN 112667709B
Authority
CN
China
Prior art keywords
data
behavior data
calibration
detected
behavior
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011553092.8A
Other languages
Chinese (zh)
Other versions
CN112667709A (en
Inventor
于磊磊
李永在
乔禹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University
Original Assignee
Shandong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University filed Critical Shandong University
Priority to CN202011553092.8A priority Critical patent/CN112667709B/en
Publication of CN112667709A publication Critical patent/CN112667709A/en
Application granted granted Critical
Publication of CN112667709B publication Critical patent/CN112667709B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a campus card leasing behavior detection method and system based on Spark, which are used for acquiring use data of a user on a campus card and taking the acquired data as data to be detected; acquiring manually screened usage data of users marked as leases on campus cards, and taking the acquired data as calibration data; converting the data to be detected into a behavior data set to be detected, and converting the calibration data into a calibration behavior data set; carrying out quantitative processing on the category characteristics in the behavior data set to be detected and the calibration behavior data set, and further carrying out standardized processing on all the characteristics; calculating the weight of each characteristic in the calibration behavior data set in parallel by using Spark; parallel weighting and recalculating distances between the behavior data to be detected and all data in the calibration behavior data set; and sequencing the data according to the distance between the behavior data to be detected and the calibration behavior data from small to large, and selecting the first K calibration behavior data to perform Gaussian weight weighted voting to obtain the category of the behavior data to be detected.

Description

Campus card leasing behavior detection method and system based on Spark
Technical Field
The application relates to the technical field of abnormal behavior data detection, in particular to a campus card leasing behavior detection method and system based on Spark.
Background
The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.
In the existing campus card management, a campus card leasing behavior exists, in order to find and stop the behavior in time, a campus card management department of a school needs to set a campus card leasing behavior detection method, but in the existing campus card leasing detection, labels are screened by means of manual experience, so that the phenomena of false detection and missed detection are easy to occur, the phenomenon of disordered use of the campus card is caused frequently, and the rights and interests of normal users of the campus card are influenced.
Disclosure of Invention
In order to overcome the defects of the prior art, the application provides a campus card leasing behavior detection method and system based on Spark;
in a first aspect, the application provides a campus card leasing behavior detection method based on Spark;
a campus card leasing behavior detection method based on Spark comprises the following steps:
acquiring use data of a campus card by a user, and taking the acquired data as data to be detected;
acquiring manually screened use data of users marked as leases on the campus card, and taking the acquired data as calibration data; converting the data to be detected into a behavior data set to be detected, and converting the calibration data into a calibration behavior data set;
respectively carrying out quantitative processing on category characteristics in the behavior data set to be detected and the calibration behavior data set; respectively carrying out standardized processing on all characteristics in the behavior data set to be detected and the calibration behavior data set;
a Spark engine is adopted to calculate the weight of each characteristic in the calibration behavior data set in parallel;
adopting a Spark engine to calculate the distance between the behavior data to be detected and all data in the calibration behavior data set in a parallel weighting manner;
and sequencing the data according to the distance between the behavior data to be detected and the calibration behavior data from small to large, and selecting the first K calibration behavior data to perform Gaussian weight weighted voting to obtain the category of the behavior data to be detected.
In a second aspect, the application provides a campus card leasing behavior detection system based on Spark;
campus card lease behavior detection system based on Spark includes:
a data acquisition module configured to: acquiring use data of a campus card by a user, and taking the acquired data as data to be detected; acquiring manually screened use data of users marked as leases on the campus card, and taking the acquired data as calibration data; converting the data to be detected into a behavior data set to be detected, and converting the calibration data into a calibration behavior data set;
a data pre-processing module configured to: respectively carrying out quantitative processing on category characteristics in the behavior data set to be detected and the calibration behavior data set; respectively carrying out standardized processing on all characteristics in the behavior data set to be detected and the calibration behavior data set;
a weight calculation module configured to: calculating the weight of each characteristic in the calibration behavior data set in parallel by adopting a Spark engine;
a distance calculation module configured to: adopting a Spark engine to calculate the distance between the behavior data to be detected and all data in the calibration behavior data set in a parallel weighting manner;
a voting module configured to: and sequencing the data according to the distance between the behavior data to be detected and the calibration behavior data from small to large, and selecting the first K calibration behavior data to perform Gaussian weight weighted voting to obtain the category of the behavior data to be detected.
In a third aspect, the present application further provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein a processor is connected to the memory, the one or more computer programs are stored in the memory, and when the electronic device is running, the processor executes the one or more computer programs stored in the memory, so as to make the electronic device execute the method according to the first aspect.
In a fourth aspect, the present application also provides a computer-readable storage medium for storing computer instructions which, when executed by a processor, perform the method of the first aspect.
In a fifth aspect, the present application also provides a computer program (product) comprising a computer program for implementing the method of any of the preceding first aspects when run on one or more processors.
Compared with the prior art, the beneficial effects of this application are:
a novel quick and efficient campus card leasing behavior detection method and system based on Spark are provided, and a batch analysis mining mode of data is adopted to replace the existing individual analysis screening mode which is mainly determined by experience judgment and evidence, so that the efficiency and the accuracy of leasing behavior detection are remarkably improved, non-explicit leasing behaviors can be effectively detected, and the comprehensive management of a campus card system can be effectively assisted.
Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application.
FIG. 1 is a flow chart of the method of the first embodiment.
Detailed Description
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, and it should be understood that the terms "comprises" and "comprising", and any variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The embodiments and features of the embodiments of the invention may be combined with each other without conflict.
Example one
The embodiment provides a campus card leasing behavior detection method based on Spark;
a campus card leasing behavior detection method based on Spark comprises the following steps:
s101: acquiring use data of a campus card by a user, and taking the acquired data as data to be detected;
acquiring manually screened use data of users marked as leases on the campus card, and taking the acquired data as calibration data; converting the data to be detected into a behavior data set to be detected, and converting the calibration data into a calibration behavior data set;
s102: respectively carrying out quantitative processing on category characteristics in the behavior data set to be detected and the calibration behavior data set; respectively carrying out standardized processing on all characteristics in the behavior data set to be detected and the calibration behavior data set;
s103: calculating the weight of each characteristic in the calibration behavior data set in parallel by adopting a Spark engine;
s104: adopting a Spark engine to calculate the distance between the behavior data to be detected and all data in the calibration behavior data set in a parallel weighting manner;
s105: and sequencing the data according to the distance between the behavior data to be detected and the calibration behavior data from small to large, and selecting the first K calibration behavior data to perform Gaussian weight weighted voting to obtain the category of the behavior data to be detected.
As one or more embodiments, the S101: converting the data to be detected into a behavior data set to be detected; the method comprises the following specific steps:
data to be detected includes: account number, school number, name, gender, college, identity type, transaction amount, transaction merchant, and transaction time;
performing feature extraction on data to be detected to obtain features of the data to be detected; the data characteristics to be detected comprise: gender, identity, whether graduation class exists, total consumption amount, total consumption times, catering consumption amount, bathing consumption proportion, fitness consumption proportion, whether learning related technology exists or not and whether medical related records exist or not;
and storing the characteristics of the data to be detected according to the user number to obtain a behavior data set to be detected.
As one or more embodiments, the S101: converting the calibration data into a calibration behavior data set; the method comprises the following specific steps:
calibration data, comprising: an account number, school number, name, gender, college, identity type, transaction amount, transaction merchant, transaction time, and whether a lease activity is present;
extracting the characteristics of the calibration data to obtain the characteristics of the calibration data; the calibration data characteristics include: the label of sex, identity, whether graduation class, total consumption amount, total consumption times, catering consumption amount, bath consumption ratio, fitness consumption ratio, whether learning related technology exists, whether medical related record exists and whether lease behavior exists;
and storing the calibration data characteristics according to the numbers to obtain a calibration behavior data set.
The data to be detected is converted into a behavior data set to be detected, the calibration data is converted into a calibration behavior data set, the conversion is realized through a consumption behavior data model, the consumption behavior data model is designed based on data statistics and experience judgment, a huge amount of consumption running water without significant behavior characteristics is converted into consumption behavior data with moderate quantity and significant behavior characteristics through data aggregation, consumption running water records of each campus card every week are merged into one consumption behavior data, and 6 types of 11 characteristics shown in a table 1 are defined.
Table 1 behavioral characteristics definition table
Figure BDA0002857572130000061
Further, the consumption behavior data model, the identity category, mainly characterizes whether the card requirement for the person is rigid, includes: sex, identity factor and whether graduation shift is present; the identity factor comprises: this department, research students, doctor students, teaching employees, alumni and temporary personnel; the overall consumption category represents whether the consumption behavior is continuously stable or not, and comprises the following steps: total amount consumed and total number of times consumed; the category of living food represents the characteristics of a typical school student population, including: the catering consumption amount, the bathing consumption amount and the bathing consumption proportion; the exercise and fitness category characterization and analysis rental card is used for school exercise and fitness behaviors and comprises the following steps: body-building consumption ratio; whether the study work classification represents the record with personal academic work behaviors such as self-service printing, book borrowing and the like comprises the following steps: with or without learning related records; the medical care category represents whether a hospital medical record exists or not, and comprises the following steps: with or without medical related records.
As one or more embodiments, the S102: respectively carrying out quantitative processing on category characteristics in the behavior data set to be detected and the calibration behavior data set; respectively carrying out standardized processing on all characteristics in the behavior data set to be detected and the calibration behavior data set; the method comprises the following specific steps:
respectively carrying out characteristic quantization processing on the categories in the behavior dataset to be detected and the calibration behavior dataset by adopting a one-hot (one-hot) coding quantization method;
and respectively carrying out standardization processing on all characteristics in the behavior data set to be detected and the calibration behavior data set by adopting a Z-score standardization method.
The method adopts a one-hot (one-hot) coding quantization method, is a common method for class characteristic quantization, uses an N-bit state register to code N classes, maps a certain point of the class characteristic to an Euclidean space, and solves the distance rationality problem of the class characteristic.
Further, the one-hot (one-hot) coding quantization method is adopted, and the one-hot coding of the category features in table 1 is shown in table 2. Meanwhile, the new code expands the features, and the number of the features is expanded from 11 to 20.
Table 2 behavior feature definition table
Figure BDA0002857572130000071
Figure BDA0002857572130000081
The Z-score standardization method is a common method for carrying out standardization processing on characteristic data, so that characteristic values are in the same order of magnitude, and all characteristics are processed into new data distribution with the average value of 0 and the standard deviation of 1.
Figure BDA0002857572130000082
Wherein, mu and sigma are respectively the mean value and standard deviation of the characteristic data, and x' is the normalized new characteristic data.
As one or more embodiments, the S103: calculating the weight of each characteristic in the calibration behavior data set in parallel by adopting a Spark engine; the method comprises the following specific steps:
a Driver of the Spark calculation engine stores the calibration behavior data set on a distributed file system (HDFS), and then RDD is carried out to convert the calibration behavior data set into an elastic distributed data set;
each Work node Work is divided into a plurality of actuators Executor according to resources, an improved Relief method is adopted in parallel on each actuator Executor, weight calculation is carried out on each feature in a calibration behavior data set, and a feature weight value on the current actuator Executor is obtained;
and the Driver averages the characteristic weights obtained from the executors and sets the average as the characteristic weight value.
The Relief method in the prior art is an existing method for performing weight calculation on features; the method comprises the following steps:
randomly extracting a sample R from a training sample set each time, finding out a nearest neighbor sample H of the R from a sample set of the same class, and finding out a nearest neighbor sample M from a sample set of different classes of the R;
when updating the weight, if the distance between R and H on a certain feature is smaller than the distance between R and M, the weight of the feature is increased, otherwise, the weight of the feature is decreased, the above process is repeated M times, and finally the weight of each feature is obtained.
Further, the improved Relief method comprises:
randomly extracting a sample R from a segmentation node sample set of the K-D tree each time;
based on a multi-class nearest neighbor set fast acquisition algorithm of the K-D tree, finding out a nearest neighbor sample H of R from a segmented node sample set of the K-D tree of the same class;
based on a multi-class nearest neighbor set fast acquisition algorithm of the K-D tree, finding out nearest neighbor samples M from segmentation node sample sets of the K-D trees of different classes of the R;
when updating the weight, if the distance between R and H on a certain feature is smaller than the distance between R and M, increasing the weight of the feature, otherwise, reducing the weight of the feature, repeating the above process for M times, and finally obtaining the weight of each feature;
and optimizing the obtained weight of each feature.
It should be understood that the extraction range of the sample R in the Relief algorithm is changed from the training sample set to the segmentation node sample set of the K-D tree, so that the extraction distribution of R is more reasonable; and simultaneously, searching nearest neighbor samples H and M of R and similar and dissimilar samples by relying on a segmentation node sample set of a K-D tree.
It should be understood that the K-D tree, which is a fast-indexing binary tree data structure, is a partition of the K-dimensional space, and has the advantage of fast indexing data.
It should be understood that the multi-class nearest neighbor set fast acquisition algorithm based on the K-D tree optimizes the space-time overhead of the acquisition of the nearest neighbor sample set;
further, the multi-class nearest neighbor set fast acquiring algorithm based on the K-D tree includes:
first, establishing an input variable R: behavior data to be detected; K-DTree: a K-D tree; c: a category set; l: the number of nearest neighbors; d: a backtracking threshold; establishing an output variable S _ Ci: r belongs to the class CiThe nearest neighbor sample of (1);
secondly, taking a K-DTree segmentation sample node set, and adding the set S _ DN; then randomly extracting a sample node R from the S _ DN*Go back up layer by layer along the direction of the father pathTracing; until a parent sample node R is found**Satisfy the number of nodes in the subtree rooted at the parent sample node is greater than or equal to c.count x l (i.e. the product of the number of classes and the number of nearest neighbors); r is taken as R**
Third, for each class CiSelecting ones of the left and right subtrees of R as belonging to class CiSample node joining S _ Ci(ii) a And continuously backtracking upwards along the R to retrieve the data belonging to the category CiAnd the nodes closer to the node are replaced and added into S _ CiUntil one of the following conditions is satisfied: (1) reach the ROOT node ROOT; (2) backtracking reaches a backtracking threshold D; (3) s _ CiThe set is no longer changed;
fourthly, the algorithm is ended, and each S _ C is returnediAnd (4) collecting.
Further, the optimizing the obtained weight of each feature includes:
the weight calculation mode of the feature A in the Relief algorithm is as follows:
w′(A)=w(A)-ΔwN(A)+ΔwP(A)
wherein w (A) is the weight of the feature A before updating, and w' (A) is the weight of the feature A after updating;
ΔwN(A) for negative weight increments:
Figure BDA0002857572130000101
ΔwP(A) for forward weight increments:
Figure BDA0002857572130000102
wherein diff (A, R)1,R2) Comprises the following steps:
Figure BDA0002857572130000111
because w (a) may have negative numbers and cannot be adapted to the weighted mahalanobis distance calculation, the non-negative optimization process is performed on w (a) after m rounds of calculation are completed, as follows:
w(A)=z+w(A)+ε
wherein z ═ min (w (X): X ∈ S-
Wherein S is a set of features, z is an absolute value of a minimum value of feature weight generated by calculation, and epsilon is weight offset compensation, and a constant is adopted.
At the same time, to avoid that a certain value in the set is too large, resulting in other values being indistinct, diff (A, R)1,R2) Optimization was modified to a z-score normalized model:
Figure BDA0002857572130000112
as one or more embodiments, the S104: adopting a Spark engine to calculate the distance between the behavior data to be detected and all data in the calibration behavior data set in a parallel weighting manner; the method comprises the following specific steps:
a plurality of executors of each Work node Work calculate the weighted Mahalanobis distance between the behavior data to be detected and all the data in the calibration behavior data set in parallel;
the mahalanobis distance model is a distance model between vectors, and for the vector corresponding to the sample X, Y
Figure BDA0002857572130000113
The mahalanobis distance is:
Figure BDA0002857572130000114
wherein S is
Figure BDA0002857572130000115
And
Figure BDA0002857572130000116
the covariance matrix of (2).
Further onThe detection method adopts a weighted Mahalanobis distance model, and for improving the Mahalanobis distance model, firstly, the characteristic weight vector of the calibration behavior data set is calculated
Figure BDA0002857572130000117
The weighted mahalanobis distance is then calculated.
Figure BDA0002857572130000121
As one or more embodiments, the S105: sorting the data according to the distance between the behavior data to be detected and the calibration behavior data from small to large, selecting the first K calibration behavior data to perform Gaussian weight weighted voting, and obtaining the category of the behavior data to be detected; the method comprises the following specific steps:
constructing a class-free nearest neighbor set rapid acquisition algorithm based on the K-D tree of S103, and rapidly acquiring the behavior data to be detected and the first K nearest neighbor data of the calibration behavior data set;
weighting the K calibration behavior data one by adopting a Gaussian function according to the distance between the behavior data to be detected and the K calibration behavior data;
voting is carried out according to the weight and the category mark of the K pieces of calibration behavior data.
Further, the K-D tree-based class-free nearest neighbor set rapid acquisition algorithm; further comprising:
firstly, establishing K-DTree: a K-D tree; u: data to be detected; k: the number of nearest neighbors; d: a backtracking threshold; establishing an output variable Node [ h ] as a neighbor Node set;
secondly, finding the nearest neighbor point N of the U on the K-D Tree through binary Tree search; if there is a sample node N closer than N in the left and right subtree spaces of N*Then stop the search and turn N*Add Node [ h ]]Otherwise, N is added into Node [ h];
Thirdly, backtracking upwards, setting N as N father sample nodes, and repeating the second step h until the backtracking depth reaches a given threshold value D;
and fourthly, outputting Node [ h ], and finishing the algorithm.
Weighting by adopting a Gaussian function, namely weighting the calibration behavior data by adopting the Gaussian function, and for the ith calibration behavior data NiIt is calculated as follows:
Figure BDA0002857572130000122
wherein d isiAs neighbor samples NiThe distance from the sample to be classified is calculated by considering voting weight, and a is set to be 1, b is set to be 0, and c is set to be adjustable parameter.
Voting, namely voting the category of the behavior data to be detected through weighted category voting, wherein the voting is calculated as follows:
Figure BDA0002857572130000131
wherein the content of the first and second substances,
Figure BDA0002857572130000132
wherein K is the number of neighbors, L is the number of categories, CjIs the jth class, fijAnd identifying the category attribution.
Example two
The embodiment provides a campus card leasing behavior detection system based on Spark;
campus card lease behavior detection system based on Spark includes:
a data acquisition module configured to: acquiring use data of a campus card by a user, and taking the acquired data as data to be detected; acquiring manually screened use data of users marked as leases on the campus card, and taking the acquired data as calibration data; converting the data to be detected into a behavior data set to be detected, and converting the calibration data into a calibration behavior data set;
a data pre-processing module configured to: respectively carrying out quantitative processing on category characteristics in the behavior data set to be detected and the calibration behavior data set; respectively carrying out standardized processing on all characteristics in the behavior data set to be detected and the calibration behavior data set;
a weight calculation module configured to: calculating the weight of each characteristic in the calibration behavior data set in parallel by adopting a Spark engine;
a distance calculation module configured to: adopting a Spark engine to calculate the distance between the behavior data to be detected and all data in the calibration behavior data set in a parallel weighting manner;
a voting module configured to: and sequencing the data according to the distance between the behavior data to be detected and the calibration behavior data from small to large, and selecting the first K calibration behavior data to perform Gaussian weight weighted voting to obtain the category of the behavior data to be detected.
It should be noted here that the data acquisition module, the data preprocessing module, the weight calculation module, the distance calculation module, and the voting module correspond to steps S101 to S105 in the first embodiment, and the modules are the same as the corresponding steps in the implementation example and application scenario, but are not limited to the disclosure in the first embodiment. It should be noted that the modules described above as part of a system may be implemented in a computer system such as a set of computer executable instructions.
In the foregoing embodiments, the descriptions of the embodiments have different emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
The proposed system can be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the above-described modules is merely a logical division, and in actual implementation, there may be other divisions, for example, multiple modules may be combined or integrated into another system, or some features may be omitted, or not executed.
EXAMPLE III
The present embodiment also provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein, a processor is connected with the memory, the one or more computer programs are stored in the memory, and when the electronic device runs, the processor executes the one or more computer programs stored in the memory, so as to make the electronic device execute the method according to the first embodiment.
It should be understood that in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate arrays FPGA or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and so on. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory may include both read-only memory and random access memory, and may provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store device type information.
In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software.
The method in the first embodiment may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in the processor. The software modules may be located in ram, flash, rom, prom, or eprom, registers, among other storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor. To avoid repetition, it is not described in detail here.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
Example four
The present embodiments also provide a computer-readable storage medium for storing computer instructions, which when executed by a processor, perform the method of the first embodiment.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (9)

1. A campus card leasing behavior detection method based on Spark is characterized by comprising the following steps:
acquiring use data of a campus card by a user, and taking the acquired data as data to be detected;
acquiring manually screened use data of users marked as leases on the campus card, and taking the acquired data as calibration data; converting the data to be detected into a behavior data set to be detected, and converting the calibration data into a calibration behavior data set;
respectively carrying out quantitative processing on category characteristics in the behavior data set to be detected and the calibration behavior data set; respectively carrying out standardized processing on all characteristics in the behavior data set to be detected and the calibration behavior data set;
calculating the weight of each characteristic in the calibration behavior data set in parallel by adopting a Spark engine;
adopting a Spark engine to calculate the distance between the behavior data to be detected and all data in the calibration behavior data set in a parallel weighting manner;
sorting the data according to the distance between the behavior data to be detected and the calibration behavior data from small to large, selecting the first K calibration behavior data to perform Gaussian weight weighted voting, and obtaining the category of the behavior data to be detected;
sorting the data according to the distance between the behavior data to be detected and the calibration behavior data from small to large, selecting the first K calibration behavior data to perform Gaussian weight weighted voting, and obtaining the category of the behavior data to be detected; the method comprises the following specific steps:
rapidly acquiring the behavior data to be detected and the first K nearest neighbor data of the calibration behavior data set based on a class-free nearest neighbor set rapid acquisition algorithm of a K-D tree;
weighting the K calibration behavior data one by adopting a Gaussian function according to the distance between the behavior data to be detected and the K calibration behavior data;
voting according to the weight and the category mark of the K pieces of calibration behavior data;
the class-free nearest neighbor set rapid acquisition algorithm based on the K-D tree; further comprising:
firstly, establishing K-DTree: a K-D tree; u: data to be detected; k: the number of nearest neighbors; d: a backtracking threshold; establishing an output variable Node [ h ] as a neighbor Node set;
secondly, finding the nearest neighbor point N of the U on the K-D Tree through binary Tree search; if there is a sample node N closer than N in the left and right subtree spaces of N*Then stop the search and turn N*Add Node [ h ]]Otherwise, N is added into Node [ h];
Thirdly, backtracking upwards, setting N as N father sample nodes, and repeating the second step h until the backtracking depth reaches a given threshold value D;
fourthly, outputting Node [ h ], and finishing the algorithm;
the weighting is carried out by adopting a Gaussian function, namely the weighting is carried out on the calibration behavior data by adopting the Gaussian function, and the ith calibration behavior data N isiIt is calculated as follows:
Figure FDA0003567658220000021
wherein d isiAs neighbor samples NiSetting a to be 1, b to be 0 and c to be adjustable parameters by considering the calculation of voting weight;
voting, namely voting the category of the behavior data to be detected through weighted category voting, wherein the voting is calculated as follows:
Figure FDA0003567658220000022
wherein the content of the first and second substances,
Figure FDA0003567658220000023
wherein K is the number of neighbors, L is the number of categories, CjIs the jth class, fijAnd identifying the category attribution.
2. The method for detecting campus card leasing behavior according to claim 1, wherein,
converting the data to be detected into a behavior data set to be detected; the method comprises the following specific steps:
data to be detected includes: account number, school number, name, gender, college, identity type, transaction amount, transaction merchant, and transaction time;
performing feature extraction on data to be detected to obtain features of the data to be detected; the data characteristics to be detected comprise: gender, identity, whether graduation class exists, total consumption amount, total consumption times, catering consumption amount, bathing consumption proportion, fitness consumption proportion, whether learning related technology exists or not and whether medical related records exist or not;
and storing the characteristics of the data to be detected according to the user number to obtain a behavior data set to be detected.
3. The method for detecting campus card leasing behavior according to claim 1, wherein,
converting the calibration data into a calibration behavior data set; the method comprises the following specific steps:
calibration data, comprising: an account number, school number, name, gender, college, identity type, transaction amount, transaction merchant, transaction time, and whether a lease activity is present;
extracting the characteristics of the calibration data to obtain the characteristics of the calibration data; the calibration data characteristics include: the label of sex, identity, whether graduation class, total consumption amount, total consumption times, catering consumption amount, bath consumption ratio, fitness consumption ratio, whether learning related technology exists, whether medical related record exists and whether lease behavior exists;
and storing the calibration data characteristics according to the numbers to obtain a calibration behavior data set.
4. The method for detecting campus card leasing behavior according to claim 1, wherein,
calculating the weight of each characteristic in the calibration behavior data set in parallel by adopting a Spark engine; the method comprises the following specific steps:
a Driver of the Spark calculation engine stores the calibration behavior data set on a distributed file system (HDFS), and then RDD is carried out to convert the calibration behavior data set into an elastic distributed data set;
each Work node Work is divided into a plurality of actuators Executor according to resources, an improved Relief method is adopted in parallel on each actuator Executor, weight calculation is carried out on each feature in a calibration behavior data set, and a feature weight value on the current actuator Executor is obtained;
and the Driver averages the characteristic weights obtained from the executors and sets the average as the characteristic weight value.
5. The method for detecting campus card leasing behavior according to claim 4, wherein,
the improved Relief method, comprising:
the method comprises the following steps: randomly extracting a sample R from a segmentation node sample set of the K-D tree each time;
step two: based on a multi-class nearest neighbor set fast acquisition algorithm of the K-D tree, finding out a nearest neighbor sample H of R from a segmented node sample set of the K-D tree of the same class;
step three: based on a multi-class nearest neighbor set fast acquisition algorithm of the K-D tree, finding out nearest neighbor samples M from segmentation node sample sets of the K-D trees of different classes of the R;
step four: when the weight is updated, if the distance between R and H on a certain feature is smaller than the distance between R and M, the weight of the feature is increased, otherwise, the weight of the feature is reduced,
repeating the processes from the first step to the fourth step for m times, and finally obtaining the weight of each feature;
and optimizing the obtained weight of each feature.
6. The method for detecting campus card leasing behavior according to claim 5, wherein,
the multi-class nearest neighbor set fast acquisition algorithm based on the K-D tree comprises the following steps:
first, establishing an input variable R: behavior data to be detected; K-DTree: a K-D tree; c: a category set; l: the number of nearest neighbors; d: a backtracking threshold; establishing an output variable S _ Ci: r belongs to the class CiThe nearest neighbor sample of (1);
secondly, taking a K-DTree segmentation sample node set, and adding the set S _ DN; then randomly extracting a sample node R from the S _ DN*Tracing back upwards layer by layer along the father path direction; until a parent sample node R is found**The number of nodes in a subtree taking the father sample node as a root is larger than or equal to C.count x l; r is taken as R**
Third, for each class CiSelecting ones of the left and right subtrees of R as belonging to class CiSample node joining S _ Ci(ii) a And continuously backtracking upwards along the R to retrieve the data belonging to the category CiAnd the nodes closer to the node are replaced and added into S _ CiUntil one of the following conditions is satisfied: (1) reach the ROOT node ROOT; (2) backtracking reaches a backtracking threshold D; (3) s _ CiThe set is no longer changed;
fourthly, the algorithm is ended, and each S _ C is returnediAnd (4) collecting.
7. Campus card lease behavior detection system based on Spark, characterized by includes:
a data acquisition module configured to: acquiring use data of a campus card by a user, and taking the acquired data as data to be detected; acquiring manually screened use data of users marked as leases on the campus card, and taking the acquired data as calibration data; converting the data to be detected into a behavior data set to be detected, and converting the calibration data into a calibration behavior data set;
a data pre-processing module configured to: respectively carrying out quantitative processing on category characteristics in the behavior data set to be detected and the calibration behavior data set; respectively carrying out standardized processing on all characteristics in the behavior data set to be detected and the calibration behavior data set;
a weight calculation module configured to: calculating the weight of each characteristic in the calibration behavior data set in parallel by adopting a Spark engine;
a distance calculation module configured to: adopting a Spark engine to calculate the distance between the behavior data to be detected and all data in the calibration behavior data set in a parallel weighting manner;
a voting module configured to: sorting the data according to the distance between the behavior data to be detected and the calibration behavior data from small to large, selecting the first K calibration behavior data to perform Gaussian weight weighted voting, and obtaining the category of the behavior data to be detected;
sorting the data according to the distance between the behavior data to be detected and the calibration behavior data from small to large, selecting the first K calibration behavior data to perform Gaussian weight weighted voting, and obtaining the category of the behavior data to be detected; the method comprises the following specific steps:
rapidly acquiring the behavior data to be detected and the first K nearest neighbor data of the calibration behavior data set based on a class-free nearest neighbor set rapid acquisition algorithm of a K-D tree;
weighting the K calibration behavior data one by adopting a Gaussian function according to the distance between the behavior data to be detected and the K calibration behavior data;
voting according to the weight and the category mark of the K pieces of calibration behavior data;
the class-free nearest neighbor set rapid acquisition algorithm based on the K-D tree; further comprising:
firstly, establishing K-DTree: a K-D tree; u: data to be detected; k: the number of nearest neighbors; d: a backtracking threshold; establishing an output variable Node [ h ] as a neighbor Node set;
secondly, finding the nearest neighbor point N of the U on the K-D Tree through binary Tree search; if there is a sample node N closer than N in the left and right subtree spaces of N*Then stop the search and turn N*Add Node [ h ]]Otherwise, N is added into Node [ h];
Thirdly, backtracking upwards, setting N as N father sample nodes, and repeating the second step h until the backtracking depth reaches a given threshold value D;
fourthly, outputting Node [ h ], and finishing the algorithm;
weighting by adopting a Gaussian function, namely weighting the calibration behavior data by adopting the Gaussian function, and for the ith calibration behavior data NiIt is calculated as follows:
Figure FDA0003567658220000061
wherein d isiAs neighbor samples NiSetting a to be 1, b to be 0 and c to be adjustable parameters by considering the calculation of voting weight;
voting, namely voting the category of the behavior data to be detected through weighted category voting, wherein the voting is calculated as follows:
Figure FDA0003567658220000062
wherein the content of the first and second substances,
Figure FDA0003567658220000063
wherein K is the number of neighbors, L is the number of categories, CjIs the jth class, fijAnd identifying the category attribution.
8. An electronic device, comprising: one or more processors, one or more memories, and one or more computer programs; wherein a processor is connected to the memory, the one or more computer programs being stored in the memory, the processor executing the one or more computer programs stored in the memory when the electronic device is running, to cause the electronic device to perform the method of any of the preceding claims 1-6.
9. A computer-readable storage medium storing computer instructions which, when executed by a processor, perform the method of any one of claims 1 to 6.
CN202011553092.8A 2020-12-24 2020-12-24 Campus card leasing behavior detection method and system based on Spark Active CN112667709B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011553092.8A CN112667709B (en) 2020-12-24 2020-12-24 Campus card leasing behavior detection method and system based on Spark

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011553092.8A CN112667709B (en) 2020-12-24 2020-12-24 Campus card leasing behavior detection method and system based on Spark

Publications (2)

Publication Number Publication Date
CN112667709A CN112667709A (en) 2021-04-16
CN112667709B true CN112667709B (en) 2022-05-03

Family

ID=75408546

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011553092.8A Active CN112667709B (en) 2020-12-24 2020-12-24 Campus card leasing behavior detection method and system based on Spark

Country Status (1)

Country Link
CN (1) CN112667709B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109948705A (en) * 2019-03-20 2019-06-28 武汉大学 A kind of rare class detection method and device based on k neighbour's figure
WO2019233189A1 (en) * 2018-06-04 2019-12-12 江南大学 Method for detecting sensor network abnormal data

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8489596B1 (en) * 2013-01-04 2013-07-16 PlaceIQ, Inc. Apparatus and method for profiling users
US20170193372A1 (en) * 2016-01-06 2017-07-06 The Boeing Company Health Management Using Distances for Segmented Time Series
CN107590218B (en) * 2017-09-01 2020-11-06 南京理工大学 Spark-based multi-feature combined Chinese text efficient clustering method
CN108921188B (en) * 2018-05-23 2020-11-17 重庆邮电大学 Parallel CRF method based on Spark big data platform
CN111177301B (en) * 2019-11-26 2023-05-26 云南电网有限责任公司昆明供电局 Method and system for identifying and extracting key information
CN111626821B (en) * 2020-05-26 2024-03-12 山东大学 Product recommendation method and system for realizing customer classification based on integrated feature selection

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019233189A1 (en) * 2018-06-04 2019-12-12 江南大学 Method for detecting sensor network abnormal data
CN109948705A (en) * 2019-03-20 2019-06-28 武汉大学 A kind of rare class detection method and device based on k neighbour's figure

Also Published As

Publication number Publication date
CN112667709A (en) 2021-04-16

Similar Documents

Publication Publication Date Title
CN105469096B (en) A kind of characteristic bag image search method based on Hash binary-coding
CN106778863A (en) The warehouse kinds of goods recognition methods of dictionary learning is differentiated based on Fisher
US10929775B2 (en) Statistical self learning archival system
CN110991474A (en) Machine learning modeling platform
CN111914159B (en) Information recommendation method and terminal
CN112559900B (en) Product recommendation method and device, computer equipment and storage medium
CN112308115B (en) Multi-label image deep learning classification method and equipment
Casalino et al. Incremental adaptive semi-supervised fuzzy clustering for data stream classification
CN111931505A (en) Cross-language entity alignment method based on subgraph embedding
CN103064941A (en) Image retrieval method and device
CN113761259A (en) Image processing method and device and computer equipment
Rahimi et al. Improve poultry farm efficiency in Iran: using combination neural networks, decision trees, and data envelopment analysis (DEA)
CN109409426A (en) A kind of extreme value gradient promotion logistic regression classification prediction technique
CN117153268A (en) Cell category determining method and system
Peng et al. The health care fraud detection using the pharmacopoeia spectrum tree and neural network analytic contribution hierarchy process
CN111708865B (en) Technology forecasting and patent early warning analysis method based on improved XGboost algorithm
CN112667709B (en) Campus card leasing behavior detection method and system based on Spark
CN109784406A (en) A kind of user draws a portrait method, apparatus, readable storage medium storing program for executing and terminal device
CN115730152A (en) Big data processing method and big data processing system based on user portrait analysis
CN115345248A (en) Deep learning-oriented data depolarization method and device
CN113553326A (en) Spreadsheet data processing method, device, computer equipment and storage medium
CN113407700A (en) Data query method, device and equipment
CN112732891A (en) Office course recommendation method and device, electronic equipment and medium
CN110532384A (en) A kind of multitask dictionary list classification method, system, device and storage medium
CN109858532A (en) A kind of user draws a portrait method, apparatus, readable storage medium storing program for executing and terminal device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant