WO2019200739A1 - Data fraud identification method, apparatus, computer device, and storage medium - Google Patents

Data fraud identification method, apparatus, computer device, and storage medium Download PDF

Info

Publication number
WO2019200739A1
WO2019200739A1 PCT/CN2018/095389 CN2018095389W WO2019200739A1 WO 2019200739 A1 WO2019200739 A1 WO 2019200739A1 CN 2018095389 W CN2018095389 W CN 2018095389W WO 2019200739 A1 WO2019200739 A1 WO 2019200739A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
feature data
feature
irrelevant
fraud
Prior art date
Application number
PCT/CN2018/095389
Other languages
French (fr)
Chinese (zh)
Inventor
王义文
王健宗
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2019200739A1 publication Critical patent/WO2019200739A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q20/00Payment architectures, schemes or protocols
    • G06Q20/38Payment protocols; Details thereof
    • G06Q20/382Payment protocols; Details thereof insuring higher security of transaction
    • G06Q20/3829Payment protocols; Details thereof insuring higher security of transaction involving key management

Definitions

  • the present application relates to the field of data fraud identification, and in particular to a data fraud identification method, apparatus, computer device and storage medium.
  • Blockchain is a decentralized, trust-free new data architecture that is owned, managed, and supervised by all nodes in the network and does not accept a single aspect of control. There is no data fraud in the blockchain, but there will be fraudulent data formed by malicious bills. How to identify the data on the blockchain is generated by normal transactions or "brushed" is an urgent problem to be solved.
  • the main purpose of the present application is to provide a data fraud identification method, apparatus, computer device and storage medium that can effectively identify fraudulent data on a blockchain.
  • the present application provides a data fraud identification method for data fraud identification on a blockchain, the method comprising:
  • the outlier data is identified by the Voronoi algorithm to obtain fraud data.
  • the application also provides a data fraud identification device for data fraud identification on a blockchain, the device comprising:
  • An obtaining unit configured to acquire data related to a specified enterprise on a blockchain
  • a feature extraction unit configured to perform feature extraction on the acquired data to obtain a plurality of feature data
  • An uncorrelated analysis unit configured to extract, as the unrelated feature data, feature data that is not related to other feature data in the plurality of feature data;
  • the abnormality identifying unit is configured to perform an outlier identification on the irrelevant feature data by using a Voronoi algorithm to obtain fraud data.
  • the application further provides a computer device comprising a memory and a processor, the memory storing computer readable instructions, the processor executing the computer readable instructions to implement the steps of any of the methods described above.
  • the present application also provides a computer non-transitory readable storage medium having stored thereon computer readable instructions that, when executed by a processor, implement the steps of any of the methods described above.
  • the data fraud identification method, device, computer equipment and storage medium of the present application are the first to solve the problem of identifying data fraud data in the enterprise blockchain, and the Voronoi algorithm can analyze the data which may be fraudulent data, thereby enabling the enterprise You can understand whether other people or companies doing business with them may be fraudulent, and then choose the appropriate degree of cooperation.
  • FIG. 1 is a schematic flowchart of a data fraud identification method according to an embodiment of the present application.
  • FIG. 3 is a schematic flowchart of a data fraud identification method according to an embodiment of the present application.
  • FIG. 4 is a schematic block diagram showing the structure of a data fraud identification apparatus according to an embodiment of the present application.
  • FIG. 5 is a schematic block diagram showing the structure of an abnormality identifying unit according to an embodiment of the present application.
  • FIG. 6 is a schematic block diagram showing the structure of an uncorrelated analysis unit according to an embodiment of the present application.
  • FIG. 7 is a schematic block diagram showing the structure of an uncorrelated analysis unit according to another embodiment of the present application.
  • FIG. 8 is a schematic block diagram showing the structure of a data fraud identification apparatus according to an embodiment of the present application.
  • FIG. 9 is a schematic block diagram showing the structure of an uncorrelated analysis unit according to an embodiment of the present application.
  • FIG. 10 is a schematic block diagram showing the structure of a data fraud identification apparatus according to an embodiment of the present application.
  • FIG. 11 is a schematic block diagram showing the structure of a computer device according to an embodiment of the present application.
  • an embodiment of the present application provides a data fraud identification method for data fraud identification on a blockchain, where the method includes the steps of: S1: acquiring data related to a designated enterprise on a blockchain;
  • the above blockchain is a decentralized, trust-free new data architecture, which is jointly owned, managed, and supervised by all nodes in the network, and does not accept single-party control.
  • a blockchain is a distributed ledger database that manages a growing number of transaction records that are organized into blocks and protected against tampering.
  • Distributed ledger refers to digital ownership records that are different from traditional database technologies (because no central administrator or central data store is required), which can be replicated between different nodes of a peer-to-peer network, and each transaction is made up of a private key. sign.
  • a consensus mechanism is a mechanism for the application of blockchain or distributed ledger technology that does not rely on a central authority to identify and verify a value or transaction.
  • the consensus mechanism is the basis for all blockchain and distributed ledger applications.
  • the above designated enterprise refers to the enterprise to be queried, that is, the enterprise to be queried whether there is fraudulent data on the blockchain.
  • the above related data is generally all data related to the designated enterprise in the blockchain, such as account, amount, date, time, currency, channel, merchant, product information, user IP, device, etc. related to the designated enterprise.
  • the specific method for obtaining the specified data includes inputting a keyword such as a company name and a business scope, and then performing a search on the blockchain to obtain all data related to the search term.
  • the specific process includes: integrating data, normalizing the data into a data set, collecting it; formatting the data in the data set, cleaning the sampled data, etc.; and then converting the sampled data to obtain the required data.
  • Feature data includes: integrating data, normalizing the data into a data set, collecting it; formatting the data in the data set, cleaning the sampled data, etc.; and then converting the sampled data to obtain the required data.
  • the feature extraction can use the ReliefF algorithm.
  • the Relief algorithm is a feature weighting algorithm, which assigns different features according to the correlation of each feature and category). Algorithms with weights, features whose weights are less than a certain threshold will be removed), which can be processed with respect to the Relief algorithm, can handle multi-category problems.
  • the ReliefF algorithm is used to process regression problems where the target attribute is a continuous value.
  • the ReliefF algorithm randomly takes a sample R from the training sample set, and then finds the k neighbor samples (near Hits) from the R-like sample set, from the different classes of each R. In the sample set, find k neighbor samples (near Misses), and then update the weight of each feature, as follows:
  • diff(A, R 1 , R 2 ) represents the difference between the sample R 1 and the sample R 2 on the feature A
  • M j (C) represents the jth nearest neighbor sample in the class C.
  • feature extraction is performed using the Relief algorithm described above.
  • the Relief algorithm randomly selects a sample R from the training set D, and then searches for the nearest neighbor sample H from the samples of the same type R, called Near Hit, and finds the nearest neighbor sample M from the samples of different R types, called Near Miss. And then update the weight of each feature according to the following rules: If the distance between R and Near Hit on a feature is less than the distance between R and Near Miss, then the feature is beneficial for distinguishing between nearest neighbors of the same type and different classes. Then, the weight of the feature is increased; if the distance between R and Near Hit is greater than the distance between R and Near Miss, indicating that the feature has a negative effect on distinguishing the nearest neighbors of the same type and different classes, then the feature is reduced. Weights.
  • the above process is repeated m times, and finally the average weight of each feature is obtained.
  • the greater the weight of the feature the stronger the classification ability of the feature, and the weaker the ability to classify the feature.
  • the running time of the Relief algorithm increases linearly with the sampling number m of samples and the number of original features N, so the operating efficiency is very high.
  • step S3 it is to find unrelated feature data in the feature data that is not related to other data. Irrelevant feature data Because it is not related to other feature data, its corresponding raw data may be fraudulent data.
  • the above Voronoi is also called a Thiessen polygon or a Dirichlet diagram, which is composed of a set of continuous polygons consisting of vertical bisectors connecting straight lines of two adjacent points. N points that differ in plane, dividing the plane according to the nearest neighbor principle; each point is associated with its nearest neighbor.
  • the outliers can be further identified from the above irrelevant feature data by the Voronoi algorithm, and the identified outliers are considered fraud data.
  • the above Voronoi algorithm has lower complexity and faster calculation speed.
  • all relevant data of the enterprise Y is extracted on the blockchain, such as the account of the enterprise Y, the account login time, the number of transactions, the transaction amount, the channel, the merchant, the product information, the user IP, etc., and then will be obtained.
  • the data is extracted, and then the extracted feature data is subjected to correlation analysis, wherein the feature data not related to other feature data is irrelevant feature data, and the original data corresponding to the unrelated feature data may be fraudulent data.
  • the irrelevant feature data is identified by the Voronoi algorithm, and the Voronoi algorithm searches for the outliers in the above irrelevant feature data (according to The rules of the Voronoi algorithm calculate irrelevant feature data that is different from other unrelated feature data, and then sort out the outliers, wherein the first data outputted by the outliers is the most likely to be fraudulent data, and Decrease in turn, in other embodiments, you can set the difference of the first output Corresponding to the value of the raw data is the lowest possibility of fraudulent data, and in turn increased.
  • all of the abnormal values may be output, or only the original data corresponding to the abnormal value may be a specified number of abnormal values or the like which is highly likely to be fraudulent data.
  • the step S4 of obtaining the fraudulent data by using the Voronoi algorithm to perform the outlier identification on the irrelevant feature data includes:
  • the above irrelevant feature data is made into a Voronoi diagram of the point set S;
  • each irrelevant feature data is regarded as a point, thereby generating a point set S.
  • b Calculate the V-anomaly factor of each point in the point set S, and find the V-adjacent point of each point, specifically: b1, determine the Voronoi polygon V(pi) of a point pi in the point set S Near the point, calculate the average distance of pi to its neighbors, and use the reciprocal of the average distance to measure the abnormal degree of Pi;
  • V(p) the neighboring point of p determined by the V(p) side
  • V(p) the neighboring point of p determined by the V(p) side
  • V(p) the set of all V-adjacent points of point p
  • Vd(p) the reciprocal of the average distance of all V-adjacent points to p at point p, called the V-abnormality factor of point p, denoted as Vd(p),
  • ⁇ Vd(p) ⁇ is the number of all V-adjacent points of p
  • Vd(p) reflects the distribution density of points around point p.
  • n is a preset value and is an integer greater than zero.
  • the step S3 of extracting the feature data that is not related to the other feature data as the unrelated feature data in the plurality of feature data includes:
  • S31 Visualize the plurality of feature data, and record the feature data corresponding to the discrete points in the visualization as the irrelevant feature data.
  • the above-described visualization processing refers to converting the feature data into a graphic or an image on a screen by using computer graphics and image processing techniques. Because the above feature data is visualized, the human can visually distinguish the existence of the discrete points on the graphic or the image, and then select the discrete points, and the computer device records the feature data corresponding to the selected discrete points as irrelevant. Feature data.
  • the step of visualizing the plurality of feature data includes: forming the plurality of feature data into a scattergram.
  • the above scatter diagram refers to the distribution of data points on the Cartesian coordinate plane in regression analysis; it is usually used to compare aggregated data across categories. The more data you have in a scatter plot, the better the comparison will be.
  • the feature data is generally a matrix.
  • a scatter plot matrix can be used to simultaneously draw a scatter plot between the variables, so that the main correlation between multiple variables can be quickly found.
  • step S3 of extracting the feature data that is not related to the other feature data as the unrelated feature data in the plurality of feature data includes:
  • the above correlation matrix is also called a correlation coefficient matrix, which is composed of correlation coefficients between columns of the matrix. That is to say, the elements of the i-th row and the j-th column of the correlation matrix are the correlation coefficients of the i-th column and the j-th column of the original matrix.
  • a covariance matrix is generally used for analysis. The covariance is used to measure the overall error of two variables. If the trends of the two variables are consistent, the covariance is a positive value, indicating that the two variables are positively correlated. If the two variables change in opposite directions, the covariance is a negative value, indicating that the two variables are negatively correlated. If the two variables are independent of each other, the covariance is 0, indicating that the two variables are irrelevant. When the variables are greater than or equal to three groups, the corresponding covariance matrix is used.
  • step S2 of performing feature extraction on the acquired data to obtain feature data includes:
  • the preset requirement is a categorization standard, generally classifying data that may have relevance, for example, classifying data such as account, login time, transaction number, transaction amount, and the like.
  • classifying data such as account, login time, transaction number, transaction amount, and the like.
  • the system records the login time, and also records the number of transactions and the amount of each transaction or the total transaction amount. Etc., so there is an association between the data.
  • the acquired data is first classified, and then the feature extraction is performed on different types of data respectively.
  • the specific feature extraction method is performed by the above-mentioned Relief algorithm for feature extraction, or the feature extraction is performed by the above-described ReliefF algorithm.
  • the feature data is extracted and classified.
  • the feature of each type of feature data is relatively obvious and convenient for extraction.
  • the recognition accuracy of fraud data in the later stage can be improved.
  • the step S3 of extracting the feature data that is not related to the other feature data as the unrelated feature data in the plurality of feature data includes:
  • S302 Mixing irrelevant feature data corresponding to each type of data, performing correlation analysis, and recording irrelevant feature data having no correlation as final irrelevant feature data.
  • the feature extraction is performed on each type of data described above.
  • the account, login time, transaction number, transaction amount category are first visualized, or related matrix analysis, the first set of irrelevant feature data is obtained, and similarly, the channel, merchant, product information, user IP, etc.
  • the features of the data extraction are visualized, or the correlation matrix is analyzed to obtain a second set of irrelevant feature data; then the first set of irrelevant feature data and the second set of unrelated feature data are correlated, for example, the first group does not
  • the feature A in the related feature data and the second set of irrelevant feature data B are associated with each other, then A and B are eliminated, and the irrelevant irrelevant feature data is retained as the final irrelevant feature data.
  • first set of irrelevant feature data and the second set of irrelevant feature data are already heterogeneous in various types of data, which may be fraudulent data, and then correlation analysis is performed on each data that may be fraudulent data, if there is correlation.
  • the probability of normal data may be high, while the unrelated is the probability of fraudulent data.
  • Making the final irrelevant feature data into a point set S for fraud identification can improve the accuracy of fraud identification.
  • step S3 of extracting feature data that is not related to other feature data in the plurality of feature data as unrelated feature data is included, including
  • the feature data of each type of data is first visualized, discrete points in each type of feature data are selected, and feature data corresponding to each discrete point is searched; and then analyzed by correlation matrix.
  • the method analyzes whether the feature data corresponding to each discrete point is associated, and records the non-associated feature data as the above irrelevant feature data. That is, through the process of visual processing, the feature data that may be irrelevant is firstly searched, and then the feature data that may be irrelevant is processed again by the matrix analysis method to obtain the final irrelevant feature data, so as to improve the subsequent identification of fraud data. The accuracy.
  • the method includes:
  • the Voronoi algorithm outputs the first n points with the smallest anomaly factor, and the highest point has the highest probability that the corresponding data is fraudulent data, so the fraud of the fraud data is determined according to the order of its output. grade.
  • the above punitive measures generally include alarms, fines, and banned accounts.
  • the relevant data of an enterprise is extracted, and then the extracted data is analyzed by the above method. If there is no fraudulent data, the enterprise is considered to be a reputable enterprise, and if fraud data is present, the fraud data outputted is judged. The number of fraudulent data output, the lower the credibility.
  • the corresponding original data can be reversely searched, and then the fraudulent behavior of the enterprise, such as fraud amount, fraud, and the like, can be analyzed. According to the amount of fraud and/or fraud, it is judged whether to handle the alarm, or to prohibit the account number.
  • A needs to go to the B company for business inspection and sign the relevant cooperation contract.
  • A obtains the fraud data of the B enterprise in the specified time period on the blockchain through the above fraud data identification method. If there is no fraud data, the cooperation mode with higher tightness can be selected; Fraud data, but the fraud data is less. If there is a fraudulent data in the data within 5 years, you can choose a close cooperation mode; if there is more fraud data, you need to consider whether to establish a cooperative relationship with the B company.
  • the data fraud identification method in the embodiment of the present application is the first to solve the problem of identifying data fraud data in an enterprise blockchain.
  • the Voronoi algorithm can analyze data that may be fraudulent data, so that the enterprise can understand the business relationship with it. Whether other people or enterprises may have fraudulent behavior, and then choose the appropriate degree of cooperation, etc., to reduce the risk of fraud between the enterprise and the enterprise, or the cooperation between the individual and the enterprise.
  • an embodiment of the present application further provides a data fraud identification apparatus for data fraud identification on a blockchain, where the apparatus includes:
  • the obtaining unit 10 is configured to acquire data related to the designated enterprise on the blockchain;
  • the feature extraction unit 20 is configured to perform feature extraction on the acquired data to obtain a plurality of feature data.
  • the uncorrelated analysis unit 30 is configured to extract, as the unrelated feature data, feature data that is not related to other feature data in the plurality of feature data;
  • the abnormality identifying unit 40 is configured to perform abnormal value identification on the irrelevant feature data by using a Voronoi algorithm to obtain fraud data.
  • the above blockchain is a decentralized, trust-free new data architecture, which is jointly owned, managed, and supervised by all nodes in the network, and does not accept single-party control.
  • a blockchain is a distributed ledger database that manages a growing number of transaction records that are organized into blocks and protected against tampering.
  • Distributed ledger refers to digital ownership records that are different from traditional database technologies (because no central administrator or central data store is required), which can be replicated between different nodes of a peer-to-peer network, and each transaction is made up of a private key. sign.
  • a consensus mechanism is a mechanism for the application of blockchain or distributed ledger technology that does not rely on a central authority to identify and verify a value or transaction.
  • the consensus mechanism is the basis for all blockchain and distributed ledger applications.
  • the above designated enterprise refers to the enterprise to be queried, that is, the enterprise to be queried whether there is fraudulent data on the blockchain.
  • the above related data is generally all data related to the designated enterprise in the blockchain, such as account, amount, date, time, currency, channel, merchant, product information, user IP, device, etc. related to the designated enterprise.
  • the specific method for obtaining the specified data includes inputting a keyword such as a company name and a business scope, and then performing a search on the blockchain to obtain all data related to the search term.
  • the feature extraction unit 20 is a unit that completes feature extraction.
  • the specific process includes: integrating data, normalizing the data into a data set, collecting it; formatting the data in the data set, cleaning the sampled data, etc.; and then converting the sampled data to obtain the required data.
  • Feature data is a unit that completes feature extraction.
  • the specific process includes: integrating data, normalizing the data into a data set, collecting it; formatting the data in the data set, cleaning the sampled data, etc.; and then converting the sampled data to obtain the required data. Feature data.
  • feature extraction by feature extraction unit 20 uses the ReliefF algorithm, which was Kononeill's Relief algorithm in 1994 (Relief algorithm is a feature weighting algorithm, based on correlations of various features and categories). An algorithm that gives improved weights with features that are less than a certain threshold will be removed), which can handle multi-category problems relative to the Relief algorithm.
  • the ReliefF algorithm is used to process regression problems where the target attribute is a continuous value. When dealing with many types of problems, the ReliefF algorithm randomly takes a sample R from the training sample set, and then finds the k neighbor samples (near Hits) from the R-like sample set, from the different classes of each R. In the sample set, k neighbor samples are found, and then the weight of each feature is updated.
  • the specific process is described in the foregoing method embodiment, and details are not described herein.
  • feature extraction unit 20 performs feature extraction using the Relief algorithm described above.
  • the Relief algorithm randomly selects a sample R from the training set D, and then searches for the nearest neighbor sample H from the samples of the same type R, called Near Hit, and finds the nearest neighbor sample M from the samples of different R types, called NearMiss.
  • the weight of each feature is updated according to the following rules: If the distance between R and Near Hit on a feature is less than the distance between R and Near Miss, then the feature is beneficial for distinguishing between nearest neighbors of the same type and different classes, then Increasing the weight of the feature; conversely, if the distance between R and Near Hit is greater than the distance between R and Near Miss, indicating that the feature has a negative effect on distinguishing between nearest neighbors of the same type and different classes, then the weight of the feature is reduced. .
  • the above process is repeated m times, and finally the average weight of each feature is obtained. The greater the weight of the feature, the stronger the classification ability of the feature, and the weaker the ability to classify the feature.
  • the running time of the Relief algorithm increases linearly with the sampling number m of samples and the number of original features N, so the operating efficiency is very high.
  • irrelevant analysis unit 30 it is a unit for finding unrelated feature data of the above feature data that is not related to other data. Since the irrelevant feature data is not related to other feature data, its corresponding original data may be fraudulent data.
  • the above Voronoi is also called a Tyson polygon or a Dirichlet graph, which is composed of a set of continuous polygons consisting of vertical bisectors connecting straight lines of two adjacent points. N points that differ in plane, dividing the plane according to the nearest neighbor principle; each point is associated with its nearest neighbor.
  • the outliers can be further identified from the above irrelevant feature data by the Voronoi algorithm, and the identified outliers are considered fraud data.
  • the above Voronoi algorithm has lower complexity and faster calculation speed.
  • the obtaining unit 10 extracts all relevant data of the enterprise Y on the blockchain, such as the account of the enterprise Y, the account login time, the number of transactions, the transaction amount, the channel, the merchant, the product information, the user IP, and the like. Then, the feature extraction unit 20 performs feature extraction on the obtained data, and then the uncorrelated analysis unit 30 performs correlation analysis on the extracted plurality of feature data, wherein the feature data that is not related to other feature data is irrelevant feature data, and is not The original data corresponding to the related feature data may be fraudulent data.
  • the abnormality identifying unit 40 performs an abnormal value on the irrelevant feature data by the Voronoi algorithm (according to Voronoi).
  • the rule of the algorithm calculates the irrelevant feature data that is different from other unrelated feature data.
  • the Voronoi algorithm finds the outliers in the above unrelated feature data, and then sorts the outliers, the first of which is output.
  • the raw data corresponding to the outliers is the most likely to be fraudulent data.
  • the raw data may be set to an abnormal value corresponding to the first output is the lowest possibility of fraudulent data, and sequentially raised.
  • all of the abnormal values may be output, or only the original data corresponding to the abnormal value may be a specified number of abnormal values or the like which is highly likely to be fraudulent data.
  • the abnormality identifying unit 40 includes:
  • the graphics module 41 is configured to generate the above-mentioned irrelevant feature data into a Voronoi diagram of the point set S; wherein each irrelevant feature data is regarded as a point, thereby generating a point set S.
  • the calculation module 42 is configured to calculate a V-abnormality factor of each point in the point set S, and find a V-adjacent point of each point.
  • the specific execution process of the calculation module 42 includes: determining the neighboring point of the Voronoi polygon V(pi) of a point pi in the point set S, calculating the average distance of the pi to its neighboring points, and measuring the Pi by the reciprocal of the average distance.
  • the degree of abnormality; for any point p of the point set S, the neighboring point of p determined by the V(p) side is called the V-adjacent point of p, and the set of all V-adjacent points of point p is denoted by V(p).
  • Point p the reciprocal of the average distance from all V-adjacent points to p, called the V-anomaly factor of point p, denoted Vd(p),
  • ⁇ Vd(p) ⁇ is the number of all V-adjacent points of p
  • Vd(p) reflects the distribution density of points around point p.
  • Arrangement module 43 for arranging the V-anomaly factors of each point from small to large;
  • the output module 44 is configured to output a V-abnormality factor of each point and a first n points with the smallest anomaly factor, and the data corresponding to the first n points is determined to be the data with the highest risk of fraudulent data. Because the distribution of point sets around p point is sparser, the anomaly factor is smaller, indicating that the lower the correlation with other data, so the corresponding unrelated feature is the probability that the data is abnormal. The original data of the irrelevant feature data corresponding to the V-abnormality factor is the most probable of fraudulent data.
  • n is a preset value and is an integer greater than zero.
  • the uncorrelated analysis unit 30 includes:
  • the visual analysis module 31 is configured to visualize the plurality of feature data, and record the feature data corresponding to the discrete points in the visualization as the irrelevant feature data.
  • the above-described visualization processing refers to converting the feature data into a graphic or an image on a screen by using computer graphics and image processing techniques. Because the above feature data is visualized, the human can visually distinguish the existence of the discrete points on the graphic or the image, and then select the discrete points, and the computer device records the feature data corresponding to the selected discrete points as irrelevant. Feature data.
  • the visual analysis module 31 includes a scatter plot creation sub-module 311 for creating the plurality of feature data into a scatter plot.
  • the above scatter diagram refers to the distribution of data points on the Cartesian coordinate plane in regression analysis; it is usually used to compare aggregated data across categories. The more data you have in a scatter plot, the better the comparison will be.
  • the feature data is generally a matrix.
  • a scatter plot matrix can be used to simultaneously draw a scatter plot between the variables, so that the main correlation between multiple variables can be quickly found.
  • the foregoing irrelevant analysis unit 30 includes:
  • the correlation matrix analysis module 32 is configured to perform correlation matrix analysis on the plurality of feature data, and extract the irrelevant feature data that is not related to other feature data.
  • the above correlation matrix is also called a correlation coefficient matrix, which is composed of correlation coefficients between columns of the matrix. That is to say, the elements of the i-th row and the j-th column of the correlation matrix are the correlation coefficients of the i-th column and the j-th column of the original matrix.
  • a covariance matrix is generally used for analysis. The covariance is used to measure the overall error of two variables. If the trends of the two variables are consistent, the covariance is a positive value, indicating that the two variables are positively correlated. If the two variables change in opposite directions, the covariance is a negative value, indicating that the two variables are negatively correlated. If the two variables are independent of each other, the covariance is 0, indicating that the two variables are irrelevant. When the variables are greater than or equal to three groups, the corresponding covariance matrix is used.
  • the feature extraction unit 20 includes:
  • a classification module 21 configured to classify the acquired data according to a preset requirement
  • the extracting module 22 is configured to perform feature extraction on each type of data separately.
  • the preset requirement is a classification standard, and generally the data that may have relevance is classified into one category, for example, data such as an account, a login time, a transaction number, a transaction amount, and the like. Classified together, because the correlation between these data is strong, the reason for the strong correlation is to log in to the corresponding system through the account, the system records the login time, and also records the number of transactions and the amount or total of each transaction. The amount of the transaction, etc., so there is an association between the data.
  • the acquired data is first classified, and then the feature extraction is performed on different types of data respectively.
  • the feature extraction method is performed by the above-mentioned Relief algorithm for feature extraction, or the feature extraction is performed by the above-described ReliefF algorithm.
  • the feature data is extracted and classified.
  • the feature of each type of feature data is relatively obvious and convenient for extraction.
  • the recognition accuracy of fraud data in the later stage can be improved.
  • the subsequent uncorrelated analysis unit 30 includes:
  • the classification analysis module 301 is configured to extract irrelevant feature data of the plurality of feature data corresponding to the various types of data;
  • the hybrid analysis module 302 is configured to mix the uncorrelated feature data corresponding to the various types of data, perform correlation analysis, and record the irrelevant feature data without correlation as the final irrelevant feature data.
  • the feature extraction is performed on each type of data described above.
  • the account, login time, transaction number, transaction amount category are first visualized, or related matrix analysis, the first set of irrelevant feature data is obtained, and similarly, the channel, merchant, product information, user IP, etc.
  • the features of the data extraction are visualized, or the correlation matrix is analyzed to obtain a second set of irrelevant feature data; then the first set of irrelevant feature data and the second set of unrelated feature data are correlated, for example, the first group does not
  • the feature A in the related feature data and the second set of irrelevant feature data B are associated with each other, then A and B are eliminated, and the irrelevant irrelevant feature data is retained as the final irrelevant feature data.
  • first set of irrelevant feature data and the second set of irrelevant feature data are already heterogeneous in various types of data, which may be fraudulent data, and then correlation analysis is performed on each data that may be fraudulent data, if there is correlation.
  • the probability of normal data may be high, while the unrelated is the probability of fraudulent data.
  • Making the final irrelevant feature data into a point set S for fraud identification can improve the accuracy of fraud identification.
  • the above-mentioned irrelevant analysis unit 30 includes:
  • a visualization module 303 configured to perform visual processing on the plurality of feature data
  • the matrix analysis module 304 is configured to extract feature data corresponding to discrete points in the visualization, and perform correlation matrix analysis on the feature data corresponding to the discrete points, and extract non-associated feature data that is not associated in the feature data corresponding to each discrete point, and The non-associated feature data is recorded as the irrelevant feature data.
  • the feature data of each type of data is first visualized, the discrete points in the various feature data are selected, and the feature data corresponding to each discrete point is searched;
  • the method of matrix analysis analyzes whether the feature data corresponding to each discrete point is associated, and records the non-associated feature data as the above irrelevant feature data. That is, through the process of visual processing, the feature data that may be irrelevant is firstly searched, and then the feature data that may be irrelevant is processed again by the matrix analysis method to obtain the final irrelevant feature data, so as to improve the subsequent identification of fraud data.
  • the accuracy is, through the process of visual processing, the feature data that may be irrelevant is firstly searched, and then the feature data that may be irrelevant is processed again by the matrix analysis method to obtain the final irrelevant feature data, so as to improve the subsequent identification of fraud data.
  • the data fraud identification device further includes:
  • a fraud level determining unit 50 configured to determine a fraud level of the fraud data according to a preset rule
  • the punishment unit 60 is configured to make corresponding punishment measures according to the corresponding fraud level.
  • the Voronoi algorithm outputs the first n points with the smallest anomaly factor, and the highest point has the highest probability that the corresponding data is fraudulent data, so it is determined according to the order of its output.
  • the level of fraud in fraudulent data generally include alarms, fines, and banned accounts.
  • the relevant data of an enterprise is extracted, and then the extracted data is analyzed by the above method. If there is no fraudulent data, the enterprise is considered to be a reputable enterprise, and if fraud data is present, the fraud data outputted is judged. The number of fraudulent data output, the lower the credibility.
  • the corresponding original data can be reversely searched, and then the fraudulent behavior of the enterprise, such as fraud amount, fraud, and the like, can be analyzed. According to the amount of fraud and/or fraud, it is judged whether to handle the alarm, or to prohibit the account number.
  • A needs to go to the B company for business inspection and sign the relevant cooperation contract.
  • A obtains the fraud data of the B enterprise in the specified time period on the blockchain through the above fraud data identification method. If there is no fraud data, the cooperation mode with higher tightness can be selected; Fraud data, but the fraud data is less. If there is a fraudulent data in the data within 5 years, you can choose a close cooperation mode; if there is more fraud data, you need to consider whether to establish a cooperative relationship with the B company.
  • the data fraud identification device in the embodiment of the present application is the first to solve the problem of identifying data fraud data in an enterprise blockchain.
  • the Voronoi algorithm can analyze data that may be fraudulent data, so that the enterprise can understand the business with it. Whether other people or enterprises may have fraudulent behavior, and then choose the appropriate degree of cooperation, etc., to reduce the risk of fraud between the enterprise and the enterprise, or the cooperation between the individual and the enterprise.
  • the computer device may be a server, and its internal structure may be as shown in FIG.
  • the computer device includes a processor, memory, network interface, and database connected by a system bus. Among them, the computer designed processor is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium, an internal memory.
  • the non-volatile storage medium stores an operating system, computer readable instructions, and a database.
  • the memory provides an environment for the operation of operating systems and computer readable instructions in a non-volatile storage medium.
  • the database of the computer device is used to store data such as the Voronoi algorithm model.
  • the network interface of the computer device is used to communicate with an external terminal via a network connection.
  • the computer readable instructions are executed by a processor to implement the processes of the various method embodiments described above.
  • An embodiment of the present invention further provides a computer non-volatile readable storage medium having stored thereon computer readable instructions, which are implemented by a processor to implement the processes of the foregoing method embodiments.

Landscapes

  • Business, Economics & Management (AREA)
  • Engineering & Computer Science (AREA)
  • Accounting & Taxation (AREA)
  • Computer Security & Cryptography (AREA)
  • Finance (AREA)
  • Strategic Management (AREA)
  • Physics & Mathematics (AREA)
  • General Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The present application discloses a data fraud identification method, apparatus, computer device, and storage medium. The method comprises: acquiring data related to a specified business from a blockchain; performing feature extraction on the acquired data to obtain a plurality of feature data; extracting feature data unrelated to other feature data from the plurality of feature data to serve as unrelated feature data; and performing anomaly detection on the unrelated feature data by means of a Voronoi algorithm to obtain fraudulent data.

Description

数据欺诈识别方法、装置、计算机设备和存储介质Data fraud identification method, device, computer device and storage medium
本申请要求于2018年4月17日提交中国专利局、申请号为2018103447388,申请名称为“数据欺诈识别方法、装置、计算机设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese Patent Application entitled "Data Fraud Identification Method, Apparatus, Computer Equipment, and Storage Media" filed on April 17, 2018, the Chinese Patent Office, Application No. 2018103447388, the entire contents of which are incorporated by reference. Combined in this application.
技术领域Technical field
本申请涉及到数据欺诈识别领域,特别是涉及到一种数据欺诈识别方法、装置、计算机设备和存储介质。The present application relates to the field of data fraud identification, and in particular to a data fraud identification method, apparatus, computer device and storage medium.
背景技术Background technique
区块链是一种去中心化、无需信任的新型数据架构,它由网络中所有的节点共同拥有、管理和监督,不接受单一方面的控制。区块链不会出现数据造假,但是会存在恶意刷单形成的欺诈数据,如何识别区块链上的数据是正常交易产生的,还是“刷”出来的,是亟需解决的问题。Blockchain is a decentralized, trust-free new data architecture that is owned, managed, and supervised by all nodes in the network and does not accept a single aspect of control. There is no data fraud in the blockchain, but there will be fraudulent data formed by malicious bills. How to identify the data on the blockchain is generated by normal transactions or "brushed" is an urgent problem to be solved.
技术问题technical problem
本申请的主要目的为提供一种可以有效地识别出区块链上欺诈数据的数据欺诈识别方法、装置、计算机设备和存储介质。The main purpose of the present application is to provide a data fraud identification method, apparatus, computer device and storage medium that can effectively identify fraudulent data on a blockchain.
技术解决方案Technical solution
本申请提出一种数据欺诈识别方法,用于区块链上的数据欺诈识别,所述方法,包括:The present application provides a data fraud identification method for data fraud identification on a blockchain, the method comprising:
在区块链上获取与指定企业相关的数据;Obtaining data related to the designated enterprise on the blockchain;
将获取的数据进行特征提取,以得到多个特征数据;Extracting the acquired data to obtain a plurality of feature data;
在所述多个特征数据中提取出与其它特征数据不相关的特征数据作为不相关特征数据;Extracting feature data not related to other feature data as unrelated feature data in the plurality of feature data;
通过Voronoi算法对所述不相关特征数据进行异常值识别,得出欺诈数据。The outlier data is identified by the Voronoi algorithm to obtain fraud data.
本申请还提供一种数据欺诈识别装置,用于区块链上的数据欺诈识别,所述装置,包括:The application also provides a data fraud identification device for data fraud identification on a blockchain, the device comprising:
获取单元,用于在区块链上获取与指定企业相关的数据;An obtaining unit, configured to acquire data related to a specified enterprise on a blockchain;
特征提取单元,用于将获取的数据进行特征提取,以得到多个特征数据;a feature extraction unit, configured to perform feature extraction on the acquired data to obtain a plurality of feature data;
不相关分析单元,用于在所述多个特征数据中提取出与其它特征数据不相关的特征数据作为不相关特征数据;An uncorrelated analysis unit, configured to extract, as the unrelated feature data, feature data that is not related to other feature data in the plurality of feature data;
异常识别单元,用于通过Voronoi算法对所述不相关特征数据进行异常值识别,得出欺诈数据。The abnormality identifying unit is configured to perform an outlier identification on the irrelevant feature data by using a Voronoi algorithm to obtain fraud data.
本申请还提供一种计算机设备,包括存储器和处理器,所述存储器存储有计算机可读指令,所述处理器执行所述计算机可读指令时实现上述任一项所述方法的步骤。The application further provides a computer device comprising a memory and a processor, the memory storing computer readable instructions, the processor executing the computer readable instructions to implement the steps of any of the methods described above.
本申请还还提供一种计算机非易失性可读存储介质,其上存储有计算机可读指令,所述计算机可读指令被处理器执行时实现上述任一项所述的方法的步骤。The present application also provides a computer non-transitory readable storage medium having stored thereon computer readable instructions that, when executed by a processor, implement the steps of any of the methods described above.
有益效果Beneficial effect
本申请的数据欺诈识别方法、装置、计算机设备和存储介质,是首次解决企业区块链上数据欺诈数据的识别问题,其利用Voronoi算法,可以将可能是欺诈数据的数据分析出来,从而使企业可以了解与其进行业务往来的其它人或企业的是否可能存在欺诈行为,进而选择适当的合作关系的紧密度等。The data fraud identification method, device, computer equipment and storage medium of the present application are the first to solve the problem of identifying data fraud data in the enterprise blockchain, and the Voronoi algorithm can analyze the data which may be fraudulent data, thereby enabling the enterprise You can understand whether other people or companies doing business with them may be fraudulent, and then choose the appropriate degree of cooperation.
附图说明DRAWINGS
图1为本申请一实施例的数据欺诈识别方法的流程示意图;1 is a schematic flowchart of a data fraud identification method according to an embodiment of the present application;
图2为本申请一实施例的Voronoi图;2 is a Voronoi diagram of an embodiment of the present application;
图3为本申请一实施例的数据欺诈识别方法的流程示意图;3 is a schematic flowchart of a data fraud identification method according to an embodiment of the present application;
图4为本申请一实施例的数据欺诈识别装置的结构示意框图;4 is a schematic block diagram showing the structure of a data fraud identification apparatus according to an embodiment of the present application;
图5为本申请一实施例的异常识别单元的结构示意框图;FIG. 5 is a schematic block diagram showing the structure of an abnormality identifying unit according to an embodiment of the present application; FIG.
图6为本申请一实施例的不相关分析单元的结构示意框图;6 is a schematic block diagram showing the structure of an uncorrelated analysis unit according to an embodiment of the present application;
图7为本申请另一实施例的不相关分析单元的结构示意框图;7 is a schematic block diagram showing the structure of an uncorrelated analysis unit according to another embodiment of the present application;
图8为本申请一实施例的数据欺诈识别装置的结构示意框图;FIG. 8 is a schematic block diagram showing the structure of a data fraud identification apparatus according to an embodiment of the present application; FIG.
图9为本申请一实施例的不相关分析单元的结构示意框图;9 is a schematic block diagram showing the structure of an uncorrelated analysis unit according to an embodiment of the present application;
图10为本申请一实施例的数据欺诈识别装置的结构示意框图;FIG. 10 is a schematic block diagram showing the structure of a data fraud identification apparatus according to an embodiment of the present application; FIG.
图11为本申请一实施例的计算机设备的结构示意框图。FIG. 11 is a schematic block diagram showing the structure of a computer device according to an embodiment of the present application.
本发明的最佳实施方式BEST MODE FOR CARRYING OUT THE INVENTION
参照图1,本申请实施例提供一种数据欺诈识别方法,用于区块链上的数据欺诈识别,所述方法包括步骤:S1、在区块链上获取与指定企业相关的数据;Referring to FIG. 1, an embodiment of the present application provides a data fraud identification method for data fraud identification on a blockchain, where the method includes the steps of: S1: acquiring data related to a designated enterprise on a blockchain;
S2、将获取的数据进行特征提取,以得到多个特征数据;S2, performing feature extraction on the acquired data to obtain multiple feature data;
S3、在所述多个特征数据中提取出与其它特征数据不相关的特征数据作为不相关特征数据;S3. Extract feature data that is not related to other feature data in the plurality of feature data as irrelevant feature data;
S4、通过Voronoi算法对所述不相关特征数据进行异常值识别,得出欺诈数据。S4. Perform an outlier identification on the irrelevant feature data by using a Voronoi algorithm to obtain fraud data.
如上述步骤S1所述,上述区块链是一种去中心化、无需信任的新型数据架构,它由网络中所有的节点共同拥有、管理和监督,不接受单一方面的控制。区块链是一种管理持续增长的、按序整理成区块并受保护以防篡改的交易记录的分布式账本数据库。分布式账本是指不同于传统数据库技术的数字化所有权记录(因不需要中央管理员或中央数据存储),这种账本能在点对点网络的不同节点之间相互复制,且各项交易均由私钥签署。共识机制是指区块链或分布式账本技术应用的一种无需依赖中央机构来鉴定和验证某一数值或交易的机制。共识机制是所有区块链和分布式账本应用的基础。上述指定企业是指待查询的企业,即待查询在区块链上是否存在欺诈数据的企业。上述相关的数据一般为在区块链上与指定企业相关的全部数据,比如与指定企业相关的账户、金额、日期、时间、币种、渠道、商户、产品信息、用户IP、设备等。上述指定数据具体的获取方法包括:输入企业名称以及企业运营范围等关键词,然后到区块链上进行检索,得到与检索词相关的全部数据。As described in the above step S1, the above blockchain is a decentralized, trust-free new data architecture, which is jointly owned, managed, and supervised by all nodes in the network, and does not accept single-party control. A blockchain is a distributed ledger database that manages a growing number of transaction records that are organized into blocks and protected against tampering. Distributed ledger refers to digital ownership records that are different from traditional database technologies (because no central administrator or central data store is required), which can be replicated between different nodes of a peer-to-peer network, and each transaction is made up of a private key. sign. A consensus mechanism is a mechanism for the application of blockchain or distributed ledger technology that does not rely on a central authority to identify and verify a value or transaction. The consensus mechanism is the basis for all blockchain and distributed ledger applications. The above designated enterprise refers to the enterprise to be queried, that is, the enterprise to be queried whether there is fraudulent data on the blockchain. The above related data is generally all data related to the designated enterprise in the blockchain, such as account, amount, date, time, currency, channel, merchant, product information, user IP, device, etc. related to the designated enterprise. The specific method for obtaining the specified data includes inputting a keyword such as a company name and a business scope, and then performing a search on the blockchain to obtain all data related to the search term.
如上述步骤S2所述,即为特征提取的过程。在一个实施例中,其具体过程包括:整合数据,将数据规范化成一个数据集,收集起来;将数据集中的数据进行数据格式化,清理采样数据等;然后将采样数据进行数据转换得到需要的特征数据。As described in the above step S2, it is the process of feature extraction. In one embodiment, the specific process includes: integrating data, normalizing the data into a data set, collecting it; formatting the data in the data set, cleaning the sampled data, etc.; and then converting the sampled data to obtain the required data. Feature data.
在一具体实施例中,特征提取可以使用ReliefF算法,ReliefF算法是1994年Kononeill在Relief算法(Relief算法是一种特征权重算法(Feature weighting algorithms),根据各个特征和类别的相关性赋予特征不同的权重,权重小于某个阈值的特征将被移除)上进行改进而得到的算法,其相对于Relief算法而言,可以处理多类别问题。该ReliefF算法用于处理目标属性为连续值的回归问题。ReliefF算法在处理多类问题时,每次从训练样本集中随机取出一个样本R,然后从和R同类的样本集中找出R的k个近邻样本(near Hits),从每个R的不同类的样本集中均找出k个近邻样本(near Misses),然后更新每个特征的权重,具体如下:In a specific embodiment, the feature extraction can use the ReliefF algorithm. In 1994, Kononeill was in the Relief algorithm (the Relief algorithm is a feature weighting algorithm, which assigns different features according to the correlation of each feature and category). Algorithms with weights, features whose weights are less than a certain threshold will be removed), which can be processed with respect to the Relief algorithm, can handle multi-category problems. The ReliefF algorithm is used to process regression problems where the target attribute is a continuous value. When dealing with many types of problems, the ReliefF algorithm randomly takes a sample R from the training sample set, and then finds the k neighbor samples (near Hits) from the R-like sample set, from the different classes of each R. In the sample set, find k neighbor samples (near Misses), and then update the weight of each feature, as follows:
Figure PCTCN2018095389-appb-000001
Figure PCTCN2018095389-appb-000001
在上式中,diff(A,R 1,R 2)表示样本R 1和样本R 2在特征A上的差,其计算公式,M j(C)表示类C中第j个最近邻样本。如下式: In the above formula, diff(A, R 1 , R 2 ) represents the difference between the sample R 1 and the sample R 2 on the feature A, and the calculation formula, M j (C) represents the jth nearest neighbor sample in the class C. The following formula:
Figure PCTCN2018095389-appb-000002
Figure PCTCN2018095389-appb-000002
在另一具体实施例中,使用上述Relief算法进行特征提取。Relief算法从训练集D中随机选择一个样本R,然后从和R同类的样本中寻找最近邻样本H,称为Near Hit,从和R不同类的样本中寻找最近邻样本M,称为Near Miss,然后根据以下规则更新每个特征的权重:如果R和Near Hit在某个特征上的距离小于R和Near Miss上的距离,则说明该特征对区分同类和不同类的最近邻是有益的,则增加该特征的权重;反之,如果R和Near Hit在某个特征的距离大于R和Near Miss上的距离,说明该特征对区分同类和不同类的最近邻起负面作用,则降低该特征的权重。以上过程重复m次,最后得到各特征的平均权重。特征的权重越大,表示该特征的分类能力越强,反之,表示该特征分类能力越弱。Relief算法的运行时间随着样本的抽样次数m和原始特征个数N的增加线性增加,因而运行效率非常高。In another embodiment, feature extraction is performed using the Relief algorithm described above. The Relief algorithm randomly selects a sample R from the training set D, and then searches for the nearest neighbor sample H from the samples of the same type R, called Near Hit, and finds the nearest neighbor sample M from the samples of different R types, called Near Miss. And then update the weight of each feature according to the following rules: If the distance between R and Near Hit on a feature is less than the distance between R and Near Miss, then the feature is beneficial for distinguishing between nearest neighbors of the same type and different classes. Then, the weight of the feature is increased; if the distance between R and Near Hit is greater than the distance between R and Near Miss, indicating that the feature has a negative effect on distinguishing the nearest neighbors of the same type and different classes, then the feature is reduced. Weights. The above process is repeated m times, and finally the average weight of each feature is obtained. The greater the weight of the feature, the stronger the classification ability of the feature, and the weaker the ability to classify the feature. The running time of the Relief algorithm increases linearly with the sampling number m of samples and the number of original features N, so the operating efficiency is very high.
如上述步骤S3所述,即为找出上述特征数据中与其它数据不相关的不相关特征数据。不相关特征 数据因为与其它特征数据不相关,所以其对应的原始数据可能是欺诈数据。As described in the above step S3, it is to find unrelated feature data in the feature data that is not related to other data. Irrelevant feature data Because it is not related to other feature data, its corresponding raw data may be fraudulent data.
如上述步骤S4所述,上述Voronoi又叫泰森多边形或Dirichlet图,它是由一组由连接两邻点直线的垂直平分线组成的连续多边形组成。N个在平面上有区别的点,按照最邻近原则划分平面;每个点与它的最近邻区域相关联。通过该Voronoi算法既可以从上述不相关特征数据中进一步地识别出异常值,这些被识别出的异常值则认为是欺诈数据。上述Voronoi算法的复杂度较低,计算速度快。As described in the above step S4, the above Voronoi is also called a Thiessen polygon or a Dirichlet diagram, which is composed of a set of continuous polygons consisting of vertical bisectors connecting straight lines of two adjacent points. N points that differ in plane, dividing the plane according to the nearest neighbor principle; each point is associated with its nearest neighbor. The outliers can be further identified from the above irrelevant feature data by the Voronoi algorithm, and the identified outliers are considered fraud data. The above Voronoi algorithm has lower complexity and faster calculation speed.
在一具体实施例中,在区块链上提取企业Y的一切相关数据,比如企业Y的账户、账户登录时间、交易次数、交易金额、渠道、商户、产品信息、用户IP等,之后将得到数据进行特征提取,然后对提取出的多个特征数据进行相关性分析,其中与其它特征数据不相关的特征数据即为不相关特征数据,而不相关特征数据对应的原始数据可能是欺诈数据,为了确定不相关特征数据的对应的原始数据是欺诈数据的概率问题,通过Voronoi算法对所述不相关特征数据进行异常值识别,Voronoi算法会在上述的不相关特征数据中查找出异常值(按照Voronoi算法的规则,计算出与其它不相关特征数据存在差异的不相关特征数据),然后对异常值进行排序输出,其中最先输出的异常值对应的原始数据是欺诈数据的可能性最高,并依次降低,在其它实施例中,可以设置最先输出的异常值对应的原始数据是欺诈数据的可能性最低,并依次升高。本实施例中,可以将全部的异常值输出,也可以只输出异常值对应的原始数据是欺诈数据的可能性较高的指定数量的异常值等。In a specific embodiment, all relevant data of the enterprise Y is extracted on the blockchain, such as the account of the enterprise Y, the account login time, the number of transactions, the transaction amount, the channel, the merchant, the product information, the user IP, etc., and then will be obtained. The data is extracted, and then the extracted feature data is subjected to correlation analysis, wherein the feature data not related to other feature data is irrelevant feature data, and the original data corresponding to the unrelated feature data may be fraudulent data. In order to determine the probability that the corresponding original data of the irrelevant feature data is fraud data, the irrelevant feature data is identified by the Voronoi algorithm, and the Voronoi algorithm searches for the outliers in the above irrelevant feature data (according to The rules of the Voronoi algorithm calculate irrelevant feature data that is different from other unrelated feature data, and then sort out the outliers, wherein the first data outputted by the outliers is the most likely to be fraudulent data, and Decrease in turn, in other embodiments, you can set the difference of the first output Corresponding to the value of the raw data is the lowest possibility of fraudulent data, and in turn increased. In the present embodiment, all of the abnormal values may be output, or only the original data corresponding to the abnormal value may be a specified number of abnormal values or the like which is highly likely to be fraudulent data.
参照图2,本实施例中,上述通过Voronoi算法对所述不相关特征数据进行异常值识别,得出欺诈数据的步骤S4,具体包括:Referring to FIG. 2, in the embodiment, the step S4 of obtaining the fraudulent data by using the Voronoi algorithm to perform the outlier identification on the irrelevant feature data includes:
a、将上述不相关特征数据制作成点集S的Voronoi图;a, the above irrelevant feature data is made into a Voronoi diagram of the point set S;
其中,每个不相关特征数据就视为一个点,从而生成点集S。Among them, each irrelevant feature data is regarded as a point, thereby generating a point set S.
b、计算点集S中各点的V-异常因子,并找出每个点的V-邻近点,具体为:b1、对点集S中的一点pi的Voronoi多边形V(pi)来确定其临近点,计算pi到其各邻近点的平均距离,用平均距离的倒数来衡量Pi的异常程度;b. Calculate the V-anomaly factor of each point in the point set S, and find the V-adjacent point of each point, specifically: b1, determine the Voronoi polygon V(pi) of a point pi in the point set S Near the point, calculate the average distance of pi to its neighbors, and use the reciprocal of the average distance to measure the abnormal degree of Pi;
b2、对点集S的任意一点p,由V(p)边确定的p的邻近点称为p的V-邻近点,点p所有V-邻近点的集合记作V(p)。B2, for any point p of the point set S, the neighboring point of p determined by the V(p) side is called the V-adjacent point of p, and the set of all V-adjacent points of point p is denoted by V(p).
b3、点p所有V-邻近点到p的平均距离的倒数,称为p点的V-异常因子,记作Vd(p),B3, the reciprocal of the average distance of all V-adjacent points to p at point p, called the V-abnormality factor of point p, denoted as Vd(p),
Figure PCTCN2018095389-appb-000003
Figure PCTCN2018095389-appb-000003
其中,∣Vd(p)∣为p所有V-邻近点的个数;Where ∣Vd(p)∣ is the number of all V-adjacent points of p;
Vd(p)反映了点p周围点的分布密度,Vd(p)越大,表面p点周围点集的分布越稀疏,其异常因子也就越小。Vd(p) reflects the distribution density of points around point p. The larger Vd(p), the thinner the distribution of point sets around surface p, and the smaller the anomaly factor.
c、根据各点的V-异常因子从小到大排列;c, according to the V-anomaly factors of each point from small to large;
d、输出各点的V-异常因子,以及异常因子最小的前n个点,该前n个点对应的数据即会判定为欺诈数据风险最高的数据。因为p点周围点集的分布越稀疏,其异常因子也就越小,说明其与其它数据的相关性越低,因此其对应的不相关特征输数据是异常值的概率就越大,所以最小的V-异常因子对应的不相关特征数据的原始数据是欺诈数据的概率最大。上述n是一个预设值,为大于零的整数。d. Output the V-abnormality factor of each point and the first n points with the smallest anomaly factor. The data corresponding to the first n points will be judged as the data with the highest risk of fraudulent data. Because the distribution of point sets around p point is sparser, the anomaly factor is smaller, indicating that the lower the correlation with other data, so the corresponding unrelated feature is the probability that the data is abnormal. The original data of the irrelevant feature data corresponding to the V-abnormality factor is the most probable of fraudulent data. The above n is a preset value and is an integer greater than zero.
本实施例中,上述在所述多个特征数据中提取出与其它特征数据不相关的特征数据作为不相关特征数据的步骤S3,包括:In this embodiment, the step S3 of extracting the feature data that is not related to the other feature data as the unrelated feature data in the plurality of feature data includes:
S31、将所述多个特征数据可视化处理,将可视化中的离散点对应的特征数据记为所述不相关特征数据。S31: Visualize the plurality of feature data, and record the feature data corresponding to the discrete points in the visualization as the irrelevant feature data.
如上述S31所述,上述可视化处理是指将上述特征数据利用计算机图形学和图像处理技术,将特征数据转换成图形或图像在屏幕上显示出来。因为上述特征数据进行可视化处理,所以人可以个通过肉眼在图形或图像上直观的分辨出离散点的存在,然后选择出离散点,计算机设备会将选择的离散点对应的特征数据记为不相关特征数据。As described in the above S31, the above-described visualization processing refers to converting the feature data into a graphic or an image on a screen by using computer graphics and image processing techniques. Because the above feature data is visualized, the human can visually distinguish the existence of the discrete points on the graphic or the image, and then select the discrete points, and the computer device records the feature data corresponding to the selected discrete points as irrelevant. Feature data.
上述将所述多个特征数据可视化处理的步骤,包括:将所述多个特征数据制作成散点图。上述散点图(scatter diagram)在回归分析中是指数据点在直角坐标系平面上的分布图;通常用于比较跨类别的聚合数据。散点图中包含的数据越多,比较的效果就越好。本实施例中上述特征数据一般为矩阵,此时可利用散点图矩阵来同时绘制各自变量间的散点图,这样可以快速发现多个变量间的主要相关性。The step of visualizing the plurality of feature data includes: forming the plurality of feature data into a scattergram. The above scatter diagram refers to the distribution of data points on the Cartesian coordinate plane in regression analysis; it is usually used to compare aggregated data across categories. The more data you have in a scatter plot, the better the comparison will be. In the embodiment, the feature data is generally a matrix. In this case, a scatter plot matrix can be used to simultaneously draw a scatter plot between the variables, so that the main correlation between multiple variables can be quickly found.
在另一实施例中,上述在所述多个特征数据中提取出与其它特征数据不相关的特征数据作为不相关特征数据的步骤S3,包括:In another embodiment, the step S3 of extracting the feature data that is not related to the other feature data as the unrelated feature data in the plurality of feature data includes:
S32、将所述多个特征数据进行相关矩阵分析,提取出与其它特征数据不相关的所述不相关特征数据。S32. Perform correlation matrix analysis on the plurality of feature data, and extract the irrelevant feature data that is not related to other feature data.
上述相关矩阵也叫相关系数矩阵,其是由矩阵各列间的相关系数构成的。也就是说,相关矩阵第i行第j列的元素是原矩阵第i列和第j列的相关系数。本实施例中一般用到协方差矩阵进行分析,协方差用来衡量两个变量的总体误差,如果两个变量的变化趋势一致,协方差就是正值,说明两个变量正相关。如果两个变量的变化趋势相反,协方差就是负值,说明两个变量负相关。如果两个变量相互独立,那么协方差就是0,说明两个变量不相关,当变量大于或等于三组的时候,即会使用相应的协方差矩阵。The above correlation matrix is also called a correlation coefficient matrix, which is composed of correlation coefficients between columns of the matrix. That is to say, the elements of the i-th row and the j-th column of the correlation matrix are the correlation coefficients of the i-th column and the j-th column of the original matrix. In this embodiment, a covariance matrix is generally used for analysis. The covariance is used to measure the overall error of two variables. If the trends of the two variables are consistent, the covariance is a positive value, indicating that the two variables are positively correlated. If the two variables change in opposite directions, the covariance is a negative value, indicating that the two variables are negatively correlated. If the two variables are independent of each other, the covariance is 0, indicating that the two variables are irrelevant. When the variables are greater than or equal to three groups, the corresponding covariance matrix is used.
在又一具体实施例中,上述将获取的数据进行特征提取以得到特征数据的步骤S2,包括:In another embodiment, the step S2 of performing feature extraction on the acquired data to obtain feature data includes:
S21、根据预设要求对获取的数据进行分类;S21. Sorting the acquired data according to a preset requirement;
S22、对各类数据分别进行特征提取。S22. Perform feature extraction on each type of data separately.
如上述步骤S21和S22所述,预设要求即为归类的标准,一般是将可能存在相关性的数据归为一类,比如,将账户、登录时间、交易次数、交易金额等数据归类到一起,因为这些数据之间的相关性较强,相关性较强的原因是通过账户登录对应的系统,系统记录登录时间,还会记录交易的次数以及每一 次交易的金额或总体的交易金额等,所以各数据之间存在关联。本实施例中,先将获取的数据进行分类,然后对不同类的数据分别进行特征提取,具体的特征提取方法如通过上述的Relief算法进行特征提取,或者通过上述的ReliefF算法进行特征提取。本实施例进行分类提取特征数据,第一,每一类特征数据的特征相对明显,方便提取;第二,可以提高后期对欺诈数据的识别准确性。比如,后续的在所述多个特征数据中提取出与其它特征数据不相关的特征数据作为不相关特征数据的步骤S3,包括:As described in steps S21 and S22 above, the preset requirement is a categorization standard, generally classifying data that may have relevance, for example, classifying data such as account, login time, transaction number, transaction amount, and the like. Together, because the correlation between these data is strong, the reason for the strong correlation is that the account is logged in to the corresponding system, the system records the login time, and also records the number of transactions and the amount of each transaction or the total transaction amount. Etc., so there is an association between the data. In this embodiment, the acquired data is first classified, and then the feature extraction is performed on different types of data respectively. The specific feature extraction method is performed by the above-mentioned Relief algorithm for feature extraction, or the feature extraction is performed by the above-described ReliefF algorithm. In this embodiment, the feature data is extracted and classified. First, the feature of each type of feature data is relatively obvious and convenient for extraction. Secondly, the recognition accuracy of fraud data in the later stage can be improved. For example, the step S3 of extracting the feature data that is not related to the other feature data as the unrelated feature data in the plurality of feature data, includes:
S301、提取各类数据对应的多个特征数据中的不相关特征数据;S301. Extract unrelated feature data of multiple feature data corresponding to each type of data.
S302、将对应各类数据的不相关特征数据混合后进行相关性分析,将不具有相关性的不相关特征数据记为最终的不相关特征数据。S302: Mixing irrelevant feature data corresponding to each type of data, performing correlation analysis, and recording irrelevant feature data having no correlation as final irrelevant feature data.
如上述步骤S301和S302所述,是在上述对各类数据分别进行特征提取的基础上进行。比如,在账户、登录时间、交易次数、交易金额的类别先进行可视化处理,或者相关矩阵分析,得出第一组不相关特征数据,同样的,将渠道、商户、产品信息、用户IP等类别数据提取的特征进行可视化处理,或者相关矩阵分析,得出第二组不相关特征数据;然后将第一组不相关特征数据和第二组不相关特征数据进行相关性分析,比如第一组不相关特征数据中的特征A与第二组不相关特征数据B是相互关联的,则将A和B剔除,将各不相关的不相关特征数据保留作为最终的不相关特征数据。因为上述的第一组不相关特征数据和第二组不相关特征数据已经是各类数据中的异类,可能是欺诈数据,然后将各可能是欺诈数据的数据进行相关性分析,如果存在相关性的,则可能是正常的数据的概率偏高,而不相关的则是欺诈数据的概率偏高。将最终的不相关特征数据制作成点集S进行欺诈识别,可以提高欺诈识别的准确性。As described in the above steps S301 and S302, the feature extraction is performed on each type of data described above. For example, the account, login time, transaction number, transaction amount category are first visualized, or related matrix analysis, the first set of irrelevant feature data is obtained, and similarly, the channel, merchant, product information, user IP, etc. The features of the data extraction are visualized, or the correlation matrix is analyzed to obtain a second set of irrelevant feature data; then the first set of irrelevant feature data and the second set of unrelated feature data are correlated, for example, the first group does not The feature A in the related feature data and the second set of irrelevant feature data B are associated with each other, then A and B are eliminated, and the irrelevant irrelevant feature data is retained as the final irrelevant feature data. Because the first set of irrelevant feature data and the second set of irrelevant feature data are already heterogeneous in various types of data, which may be fraudulent data, and then correlation analysis is performed on each data that may be fraudulent data, if there is correlation. The probability of normal data may be high, while the unrelated is the probability of fraudulent data. Making the final irrelevant feature data into a point set S for fraud identification can improve the accuracy of fraud identification.
在另一实施例中,上述在所述多个特征数据中提取出与其它特征数据不相关的特征数据作为不相关特征数据的步骤S3,包括In another embodiment, the step S3 of extracting feature data that is not related to other feature data in the plurality of feature data as unrelated feature data is included, including
S303、对所述多个特征数据进行可视化处理;S303. Perform visual processing on the multiple feature data.
S304、提取可视化中的离散点对应的特征数据,并将离散点对应的特征数据进行相关矩阵分析,提取各离散点对应的特征数据中没有关联的非关联特征数据,并将所述非关联特征数据记为所述不相关特征数据。S304. Extract feature data corresponding to discrete points in the visualization, and perform correlation matrix analysis on the feature data corresponding to the discrete points, and extract non-associated feature data that is not associated in the feature data corresponding to each discrete point, and extract the non-associated feature. The data is recorded as the irrelevant feature data.
如上述步骤S303和S304所述,先对各类数据的特征数据进行可视化处理,选择出各类特征数据中的离散点,进而查找出各离散点对应的特征数据;然后再通过相关矩阵分析的方法分析各离散点对应的特征数据是否相关联,将非关联的特征数据记为上述不相关特征数据。即先通过可视化处理的过程查找出可能是不相关的特征数据,然后再通过矩阵分析的方法对可能是不相关的特征数据再一次处理,得到最终的不相关特征数据,以提高后续识别欺诈数据的准确性。As described in steps S303 and S304 above, the feature data of each type of data is first visualized, discrete points in each type of feature data are selected, and feature data corresponding to each discrete point is searched; and then analyzed by correlation matrix. The method analyzes whether the feature data corresponding to each discrete point is associated, and records the non-associated feature data as the above irrelevant feature data. That is, through the process of visual processing, the feature data that may be irrelevant is firstly searched, and then the feature data that may be irrelevant is processed again by the matrix analysis method to obtain the final irrelevant feature data, so as to improve the subsequent identification of fraud data. The accuracy.
参照图3,本实施例中,上述通过Voronoi算法对所述不相关特征数据进行异常值识别,得出欺诈数据的步骤S4之后,包括:Referring to FIG. 3, in the embodiment, after the step S4 of obtaining fraud data by using the Voronoi algorithm to perform the outlier identification on the irrelevant feature data, the method includes:
S5、根据预设规则确定该欺诈数据的欺诈等级;S5. Determine a fraud level of the fraud data according to a preset rule.
S6、根据对应的欺诈等级做出对应的惩罚措施。S6. Perform corresponding punishment measures according to the corresponding fraud level.
如上述步骤S5和S6所述,上述Voronoi算法会输出异常因子最小的前n个点,最靠前的点其对应的数据是欺诈数据的概率最高,所以根据其输出的顺序确定欺诈数据的欺诈等级。上述的惩罚措施一般包括报警、罚款、禁封账户等。比如,某一企业的相关数据被提取,然后经过上述方法对提取数据进行分析,如果不存在欺诈数据,则认为该企业是一个信誉良好的企业,如果存在欺诈数据,则判断输出的欺诈数据的个数,输出的欺诈数据个数越多,其信誉越低。还可以根据Voronoi算法输出的数据,反向查找到对应的原始数据,进而分析出企业的欺诈行为,比如,欺诈金额,欺诈行为等。根据其欺诈金额和/或欺诈行为判断是否进行报警处理,或者进行禁封账号等。As described in steps S5 and S6 above, the Voronoi algorithm outputs the first n points with the smallest anomaly factor, and the highest point has the highest probability that the corresponding data is fraudulent data, so the fraud of the fraud data is determined according to the order of its output. grade. The above punitive measures generally include alarms, fines, and banned accounts. For example, the relevant data of an enterprise is extracted, and then the extracted data is analyzed by the above method. If there is no fraudulent data, the enterprise is considered to be a reputable enterprise, and if fraud data is present, the fraud data outputted is judged. The number of fraudulent data output, the lower the credibility. According to the data output by the Voronoi algorithm, the corresponding original data can be reversely searched, and then the fraudulent behavior of the enterprise, such as fraud amount, fraud, and the like, can be analyzed. According to the amount of fraud and/or fraud, it is judged whether to handle the alarm, or to prohibit the account number.
在一个具体实施中,A需要到B企业进行业务考察以及签订相关的合作合同。A在去B企业之前先通过上述的欺诈数据识别方法获取B企业在区块链上的指定时间段内的欺诈数据,如果不存在欺诈数据,则可以选择紧密度较高的合作方式;如果存在欺诈数据,但是欺诈数据较少,如5年内的数据存在一个欺诈数据,则可以选择紧密度一般的合作方式;如果存在较多的欺诈数据,则需要考虑是否与B企业建立合作关系等。In a specific implementation, A needs to go to the B company for business inspection and sign the relevant cooperation contract. Before going to the B enterprise, A obtains the fraud data of the B enterprise in the specified time period on the blockchain through the above fraud data identification method. If there is no fraud data, the cooperation mode with higher tightness can be selected; Fraud data, but the fraud data is less. If there is a fraudulent data in the data within 5 years, you can choose a close cooperation mode; if there is more fraud data, you need to consider whether to establish a cooperative relationship with the B company.
本申请实施例的数据欺诈识别方法,是首次解决企业区块链上数据欺诈数据的识别问题,其利用Voronoi算法,可以将可能是欺诈数据的数据分析出来,从而使企业可以了解与其进行业务往来的其它人或企业的是否可能存在欺诈行为,进而选择适当的合作关系的紧密度等,降低企业与企业,或者个人与企业之间合作被欺骗的风险。The data fraud identification method in the embodiment of the present application is the first to solve the problem of identifying data fraud data in an enterprise blockchain. The Voronoi algorithm can analyze data that may be fraudulent data, so that the enterprise can understand the business relationship with it. Whether other people or enterprises may have fraudulent behavior, and then choose the appropriate degree of cooperation, etc., to reduce the risk of fraud between the enterprise and the enterprise, or the cooperation between the individual and the enterprise.
参照图4,本申请实施例还提供一种数据欺诈识别装置,用于区块链上的数据欺诈识别,所述装置包括:Referring to FIG. 4, an embodiment of the present application further provides a data fraud identification apparatus for data fraud identification on a blockchain, where the apparatus includes:
获取单元10,用于在区块链上获取与指定企业相关的数据;The obtaining unit 10 is configured to acquire data related to the designated enterprise on the blockchain;
特征提取单元20,用于将获取的数据进行特征提取,以得到多个特征数据;The feature extraction unit 20 is configured to perform feature extraction on the acquired data to obtain a plurality of feature data.
不相关分析单元30,用于在所述多个特征数据中提取出与其它特征数据不相关的特征数据作为不相关特征数据;The uncorrelated analysis unit 30 is configured to extract, as the unrelated feature data, feature data that is not related to other feature data in the plurality of feature data;
异常识别单元40,用于通过Voronoi算法对所述不相关特征数据进行异常值识别,得出欺诈数据。The abnormality identifying unit 40 is configured to perform abnormal value identification on the irrelevant feature data by using a Voronoi algorithm to obtain fraud data.
在上述获取单元10中,上述区块链是一种去中心化、无需信任的新型数据架构,它由网络中所有的节点共同拥有、管理和监督,不接受单一方面的控制。区块链是一种管理持续增长的、按序整理成区块并受保护以防篡改的交易记录的分布式账本数据库。分布式账本是指不同于传统数据库技术的数字化所有权记录(因不需要中央管理员或中央数据存储),这种账本能在点对点网络的不同节点之间相互复制,且各项交易均由私钥签署。共识机制是指区块链或分布式账本技术应用的一种无需依赖中央机构来鉴定和验证某一数值或交易的机制。共识机制是所有区块链和分布式账本应用的基础。上述指定企业是 指待查询的企业,即待查询在区块链上是否存在欺诈数据的企业。上述相关的数据一般为在区块链上与指定企业相关的全部数据,比如与指定企业相关的账户、金额、日期、时间、币种、渠道、商户、产品信息、用户IP、设备等。上述指定数据具体的获取方法包括:输入企业名称以及企业运营范围等关键词,然后到区块链上进行检索,得到与检索词相关的全部数据。In the above obtaining unit 10, the above blockchain is a decentralized, trust-free new data architecture, which is jointly owned, managed, and supervised by all nodes in the network, and does not accept single-party control. A blockchain is a distributed ledger database that manages a growing number of transaction records that are organized into blocks and protected against tampering. Distributed ledger refers to digital ownership records that are different from traditional database technologies (because no central administrator or central data store is required), which can be replicated between different nodes of a peer-to-peer network, and each transaction is made up of a private key. sign. A consensus mechanism is a mechanism for the application of blockchain or distributed ledger technology that does not rely on a central authority to identify and verify a value or transaction. The consensus mechanism is the basis for all blockchain and distributed ledger applications. The above designated enterprise refers to the enterprise to be queried, that is, the enterprise to be queried whether there is fraudulent data on the blockchain. The above related data is generally all data related to the designated enterprise in the blockchain, such as account, amount, date, time, currency, channel, merchant, product information, user IP, device, etc. related to the designated enterprise. The specific method for obtaining the specified data includes inputting a keyword such as a company name and a business scope, and then performing a search on the blockchain to obtain all data related to the search term.
在上述特征提取单元20中,即为完成特征提取的单元。在一个实施例中,其具体过程包括:整合数据,将数据规范化成一个数据集,收集起来;将数据集中的数据进行数据格式化,清理采样数据等;然后将采样数据进行数据转换得到需要的特征数据。In the feature extraction unit 20 described above, it is a unit that completes feature extraction. In one embodiment, the specific process includes: integrating data, normalizing the data into a data set, collecting it; formatting the data in the data set, cleaning the sampled data, etc.; and then converting the sampled data to obtain the required data. Feature data.
在一具体实施例中,特征提取单元20的特征提取使用ReliefF算法,ReliefF算法是1994年Kononeill在Relief算法(Relief算法是一种特征权重算法(Feature weighting algorithms),根据各个特征和类别的相关性赋予特征不同的权重,权重小于某个阈值的特征将被移除)上进行改进而得到的算法,其相对于Relief算法而言,可以处理多类别问题。该ReliefF算法用于处理目标属性为连续值的回归问题。ReliefF算法在处理多类问题时,每次从训练样本集中随机取出一个样本R,然后从和R同类的样本集中找出R的k个近邻样本(near Hits),从每个R的不同类的样本集中均找出k个近邻样本(near Misses),然后更新每个特征的权重,具体过程在上述方法实施例中已经描述,不再赘述。In a specific embodiment, feature extraction by feature extraction unit 20 uses the ReliefF algorithm, which was Kononeill's Relief algorithm in 1994 (Relief algorithm is a feature weighting algorithm, based on correlations of various features and categories). An algorithm that gives improved weights with features that are less than a certain threshold will be removed), which can handle multi-category problems relative to the Relief algorithm. The ReliefF algorithm is used to process regression problems where the target attribute is a continuous value. When dealing with many types of problems, the ReliefF algorithm randomly takes a sample R from the training sample set, and then finds the k neighbor samples (near Hits) from the R-like sample set, from the different classes of each R. In the sample set, k neighbor samples are found, and then the weight of each feature is updated. The specific process is described in the foregoing method embodiment, and details are not described herein.
在另一具体实施例中,特征提取单元20使用上述Relief算法进行特征提取。Relief算法从训练集D中随机选择一个样本R,然后从和R同类的样本中寻找最近邻样本H,称为Near Hit,从和R不同类的样本中寻找最近邻样本M,称为NearMiss,然后根据以下规则更新每个特征的权重:如果R和Near Hit在某个特征上的距离小于R和Near Miss上的距离,则说明该特征对区分同类和不同类的最近邻是有益的,则增加该特征的权重;反之,如果R和Near Hit在某个特征的距离大于R和Near Miss上的距离,说明该特征对区分同类和不同类的最近邻起负面作用,则降低该特征的权重。以上过程重复m次,最后得到各特征的平均权重。特征的权重越大,表示该特征的分类能力越强,反之,表示该特征分类能力越弱。Relief算法的运行时间随着样本的抽样次数m和原始特征个数N的增加线性增加,因而运行效率非常高。In another embodiment, feature extraction unit 20 performs feature extraction using the Relief algorithm described above. The Relief algorithm randomly selects a sample R from the training set D, and then searches for the nearest neighbor sample H from the samples of the same type R, called Near Hit, and finds the nearest neighbor sample M from the samples of different R types, called NearMiss. Then update the weight of each feature according to the following rules: If the distance between R and Near Hit on a feature is less than the distance between R and Near Miss, then the feature is beneficial for distinguishing between nearest neighbors of the same type and different classes, then Increasing the weight of the feature; conversely, if the distance between R and Near Hit is greater than the distance between R and Near Miss, indicating that the feature has a negative effect on distinguishing between nearest neighbors of the same type and different classes, then the weight of the feature is reduced. . The above process is repeated m times, and finally the average weight of each feature is obtained. The greater the weight of the feature, the stronger the classification ability of the feature, and the weaker the ability to classify the feature. The running time of the Relief algorithm increases linearly with the sampling number m of samples and the number of original features N, so the operating efficiency is very high.
在上述不相关分析单元30中,即为找出上述特征数据中与其它数据不相关的不相关特征数据的单元。不相关特征数据因为与其它特征数据不相关,所以其对应的原始数据可能是欺诈数据。In the above-described irrelevant analysis unit 30, it is a unit for finding unrelated feature data of the above feature data that is not related to other data. Since the irrelevant feature data is not related to other feature data, its corresponding original data may be fraudulent data.
在上述异常识别单元40中,上述Voronoi又叫泰森多边形或Dirichlet图,它是由一组由连接两邻点直线的垂直平分线组成的连续多边形组成。N个在平面上有区别的点,按照最邻近原则划分平面;每个点与它的最近邻区域相关联。通过该Voronoi算法既可以从上述不相关特征数据中进一步地识别出异常值,这些被识别出的异常值则认为是欺诈数据。上述Voronoi算法的复杂度较低,计算速度快。In the abnormality identifying unit 40 described above, the above Voronoi is also called a Tyson polygon or a Dirichlet graph, which is composed of a set of continuous polygons consisting of vertical bisectors connecting straight lines of two adjacent points. N points that differ in plane, dividing the plane according to the nearest neighbor principle; each point is associated with its nearest neighbor. The outliers can be further identified from the above irrelevant feature data by the Voronoi algorithm, and the identified outliers are considered fraud data. The above Voronoi algorithm has lower complexity and faster calculation speed.
在一具体实施例中,获取单元10在区块链上提取企业Y的一切相关数据,比如企业Y的账户、账户登录时间、交易次数、交易金额、渠道、商户、产品信息、用户IP等,之后特征提取单元20将得 到数据进行特征提取,然后不相关分析单元30对提取出的多个特征数据进行相关性分析,其中与其它特征数据不相关的特征数据即为不相关特征数据,而不相关特征数据对应的原始数据可能是欺诈数据,为了确定不相关特征数据的对应的原始数据是欺诈数据的概率问题,异常识别单元40通过Voronoi算法对所述不相关特征数据进行异常值(按照Voronoi算法的规则,计算出与其它不相关特征数据存在差异的不相关特征数据)识别,Voronoi算法会在上述的不相关特征数据中查找出异常值,然后对异常值进行排序输出,其中最先输出的异常值对应的原始数据是欺诈数据的可能性最高,并依次降低,在其它实施例中,也可以设定最先输出的异常值对应的原始数据是欺诈数据的可能性最低,并依次升高。本实施例中,可以将全部的异常值输出,也可以只输出异常值对应的原始数据是欺诈数据的可能性较高的指定数量的异常值等。In a specific embodiment, the obtaining unit 10 extracts all relevant data of the enterprise Y on the blockchain, such as the account of the enterprise Y, the account login time, the number of transactions, the transaction amount, the channel, the merchant, the product information, the user IP, and the like. Then, the feature extraction unit 20 performs feature extraction on the obtained data, and then the uncorrelated analysis unit 30 performs correlation analysis on the extracted plurality of feature data, wherein the feature data that is not related to other feature data is irrelevant feature data, and is not The original data corresponding to the related feature data may be fraudulent data. In order to determine the probability that the corresponding original data of the irrelevant feature data is fraud data, the abnormality identifying unit 40 performs an abnormal value on the irrelevant feature data by the Voronoi algorithm (according to Voronoi). The rule of the algorithm calculates the irrelevant feature data that is different from other unrelated feature data. The Voronoi algorithm finds the outliers in the above unrelated feature data, and then sorts the outliers, the first of which is output. The raw data corresponding to the outliers is the most likely to be fraudulent data. And successively reduced, in other embodiments, the raw data may be set to an abnormal value corresponding to the first output is the lowest possibility of fraudulent data, and sequentially raised. In the present embodiment, all of the abnormal values may be output, or only the original data corresponding to the abnormal value may be a specified number of abnormal values or the like which is highly likely to be fraudulent data.
参照图5和图2,本实施例中,上述异常识别单元40包括:Referring to FIG. 5 and FIG. 2, in the embodiment, the abnormality identifying unit 40 includes:
制图模块41,用于将上述不相关特征数据制作成点集S的Voronoi图;其中,每个不相关特征数据就视为一个点,从而生成点集S。The graphics module 41 is configured to generate the above-mentioned irrelevant feature data into a Voronoi diagram of the point set S; wherein each irrelevant feature data is regarded as a point, thereby generating a point set S.
计算模块42,用于计算点集S中各点的V-异常因子,并找出每个点的V-邻近点。该计算模块42的具体执行过程包括:对点集S中的一点pi的Voronoi多边形V(pi)来确定其临近点,计算pi到其各邻近点的平均距离,用平均距离的倒数来衡量Pi的异常程度;对点集S的任意一点p,由V(p)边确定的p的邻近点称为p的V-邻近点,点p所有V-邻近点的集合记作V(p)。点p所有V-邻近点到p的平均距离的倒数,称为p点的V-异常因子,记作Vd(p),The calculation module 42 is configured to calculate a V-abnormality factor of each point in the point set S, and find a V-adjacent point of each point. The specific execution process of the calculation module 42 includes: determining the neighboring point of the Voronoi polygon V(pi) of a point pi in the point set S, calculating the average distance of the pi to its neighboring points, and measuring the Pi by the reciprocal of the average distance. The degree of abnormality; for any point p of the point set S, the neighboring point of p determined by the V(p) side is called the V-adjacent point of p, and the set of all V-adjacent points of point p is denoted by V(p). Point p, the reciprocal of the average distance from all V-adjacent points to p, called the V-anomaly factor of point p, denoted Vd(p),
Figure PCTCN2018095389-appb-000004
Figure PCTCN2018095389-appb-000004
其中,∣Vd(p)∣为p所有V-邻近点的个数;Where ∣Vd(p)∣ is the number of all V-adjacent points of p;
Vd(p)反映了点p周围点的分布密度,Vd(p)越大,表面p点周围点集的分布越稀疏,其异常因子也就越小。Vd(p) reflects the distribution density of points around point p. The larger Vd(p), the thinner the distribution of point sets around surface p, and the smaller the anomaly factor.
排列模块43,用于根据各点的V-异常因子从小到大排列; Arrangement module 43 for arranging the V-anomaly factors of each point from small to large;
输出模块44,用于输出各点的V-异常因子,以及异常因子最小的前n个点,该前n个点对应的数据即会判定为欺诈数据风险最高的数据。因为p点周围点集的分布越稀疏,其异常因子也就越小,说明其与其它数据的相关性越低,因此其对应的不相关特征输数据是异常值的概率就越大,所以最小的V-异常因子对应的不相关特征数据的原始数据是欺诈数据的概率最大。上述n是一个预设值,为大于零的整数。The output module 44 is configured to output a V-abnormality factor of each point and a first n points with the smallest anomaly factor, and the data corresponding to the first n points is determined to be the data with the highest risk of fraudulent data. Because the distribution of point sets around p point is sparser, the anomaly factor is smaller, indicating that the lower the correlation with other data, so the corresponding unrelated feature is the probability that the data is abnormal. The original data of the irrelevant feature data corresponding to the V-abnormality factor is the most probable of fraudulent data. The above n is a preset value and is an integer greater than zero.
参照图6,本实施例中,上述不相关分析单元30,包括:Referring to FIG. 6, in the embodiment, the uncorrelated analysis unit 30 includes:
可视化分析模块31,用于将所述多个特征数据可视化处理,将可视化中的离散点对应的特征数据记为所述不相关特征数据。The visual analysis module 31 is configured to visualize the plurality of feature data, and record the feature data corresponding to the discrete points in the visualization as the irrelevant feature data.
在上述可视化分析模块31中,上述可视化处理是指将上述特征数据利用计算机图形学和图像处理技术,将特征数据转换成图形或图像在屏幕上显示出来。因为上述特征数据进行可视化处理,所以人可以个通过肉眼在图形或图像上直观的分辨出离散点的存在,然后选择出离散点,计算机设备会将选择的离散点对应的特征数据记为不相关特征数据。In the above-described visual analysis module 31, the above-described visualization processing refers to converting the feature data into a graphic or an image on a screen by using computer graphics and image processing techniques. Because the above feature data is visualized, the human can visually distinguish the existence of the discrete points on the graphic or the image, and then select the discrete points, and the computer device records the feature data corresponding to the selected discrete points as irrelevant. Feature data.
上述可视化分析模块31,包括:散点图制作子模块311,用于将所述多个特征数据制作成散点图。上述散点图(scatter diagram)在回归分析中是指数据点在直角坐标系平面上的分布图;通常用于比较跨类别的聚合数据。散点图中包含的数据越多,比较的效果就越好。本实施例中上述特征数据一般为矩阵,此时可利用散点图矩阵来同时绘制各自变量间的散点图,这样可以快速发现多个变量间的主要相关性。The visual analysis module 31 includes a scatter plot creation sub-module 311 for creating the plurality of feature data into a scatter plot. The above scatter diagram refers to the distribution of data points on the Cartesian coordinate plane in regression analysis; it is usually used to compare aggregated data across categories. The more data you have in a scatter plot, the better the comparison will be. In the embodiment, the feature data is generally a matrix. In this case, a scatter plot matrix can be used to simultaneously draw a scatter plot between the variables, so that the main correlation between multiple variables can be quickly found.
参照图7,在另一实施例中,上述不相关分析单元30,包括:Referring to FIG. 7, in another embodiment, the foregoing irrelevant analysis unit 30 includes:
相关矩阵分析模块32,用于将所述多个特征数据进行相关矩阵分析,提取出与其它特征数据不相关的所述不相关特征数据。The correlation matrix analysis module 32 is configured to perform correlation matrix analysis on the plurality of feature data, and extract the irrelevant feature data that is not related to other feature data.
在相关矩阵分析模块32中,上述相关矩阵也叫相关系数矩阵,其是由矩阵各列间的相关系数构成的。也就是说,相关矩阵第i行第j列的元素是原矩阵第i列和第j列的相关系数。本实施例中一般用到协方差矩阵进行分析,协方差用来衡量两个变量的总体误差,如果两个变量的变化趋势一致,协方差就是正值,说明两个变量正相关。如果两个变量的变化趋势相反,协方差就是负值,说明两个变量负相关。如果两个变量相互独立,那么协方差就是0,说明两个变量不相关,当变量大于或等于三组的时候,即会使用相应的协方差矩阵。In the correlation matrix analysis module 32, the above correlation matrix is also called a correlation coefficient matrix, which is composed of correlation coefficients between columns of the matrix. That is to say, the elements of the i-th row and the j-th column of the correlation matrix are the correlation coefficients of the i-th column and the j-th column of the original matrix. In this embodiment, a covariance matrix is generally used for analysis. The covariance is used to measure the overall error of two variables. If the trends of the two variables are consistent, the covariance is a positive value, indicating that the two variables are positively correlated. If the two variables change in opposite directions, the covariance is a negative value, indicating that the two variables are negatively correlated. If the two variables are independent of each other, the covariance is 0, indicating that the two variables are irrelevant. When the variables are greater than or equal to three groups, the corresponding covariance matrix is used.
参照图8,在又一具体实施例中,上述特征提取单元20,包括:Referring to FIG. 8, in another embodiment, the feature extraction unit 20 includes:
分类模块21,用于根据预设要求对获取的数据进行分类;a classification module 21, configured to classify the acquired data according to a preset requirement;
提取模块22,用于对各类数据分别进行特征提取。The extracting module 22 is configured to perform feature extraction on each type of data separately.
在上述分类模块21和提取模块22中,预设要求即为归类的标准,一般是将可能存在相关性的数据归为一类,比如,将账户、登录时间、交易次数、交易金额等数据归类到一起,因为这些数据之间的相关性较强,相关性较强的原因是通过账户登录对应的系统,系统记录登录时间,还会记录交易的次数以及每一次交易的金额或总体的交易金额等,所以各数据之间存在关联。本实施例中,先将获取的数据进行分类,然后对不同类的数据分别进行特征提取,具体的特征提取方法如通过上述的Relief算法进行特征提取,或者通过上述的ReliefF算法进行特征提取。本实施例进行分类提取特征数据,第一,每一类特征数据的特征相对明显,方便提取;第二,可以提高后期对欺诈数据的识别准确性。比如,后续的不相关分析单元30,包括:In the above classification module 21 and the extraction module 22, the preset requirement is a classification standard, and generally the data that may have relevance is classified into one category, for example, data such as an account, a login time, a transaction number, a transaction amount, and the like. Classified together, because the correlation between these data is strong, the reason for the strong correlation is to log in to the corresponding system through the account, the system records the login time, and also records the number of transactions and the amount or total of each transaction. The amount of the transaction, etc., so there is an association between the data. In this embodiment, the acquired data is first classified, and then the feature extraction is performed on different types of data respectively. The specific feature extraction method is performed by the above-mentioned Relief algorithm for feature extraction, or the feature extraction is performed by the above-described ReliefF algorithm. In this embodiment, the feature data is extracted and classified. First, the feature of each type of feature data is relatively obvious and convenient for extraction. Secondly, the recognition accuracy of fraud data in the later stage can be improved. For example, the subsequent uncorrelated analysis unit 30 includes:
分类分析模块301,用于提取各类数据对应的多个特征数据中的不相关特征数据;The classification analysis module 301 is configured to extract irrelevant feature data of the plurality of feature data corresponding to the various types of data;
混合分析模块302,用于将对应各类数据的不相关特征数据混合后进行相关性分析,将不具有相关性的不相关特征数据记为最终的不相关特征数据。The hybrid analysis module 302 is configured to mix the uncorrelated feature data corresponding to the various types of data, perform correlation analysis, and record the irrelevant feature data without correlation as the final irrelevant feature data.
在上述分类分析模块301和混合分析模块302中,是在上述对各类数据分别进行特征提取的基础上进行。比如,在账户、登录时间、交易次数、交易金额的类别先进行可视化处理,或者相关矩阵分析,得出第一组不相关特征数据,同样的,将渠道、商户、产品信息、用户IP等类别数据提取的特征进行可视化处理,或者相关矩阵分析,得出第二组不相关特征数据;然后将第一组不相关特征数据和第二组不相关特征数据进行相关性分析,比如第一组不相关特征数据中的特征A与第二组不相关特征数据B是相互关联的,则将A和B剔除,将各不相关的不相关特征数据保留作为最终的不相关特征数据。因为上述的第一组不相关特征数据和第二组不相关特征数据已经是各类数据中的异类,可能是欺诈数据,然后将各可能是欺诈数据的数据进行相关性分析,如果存在相关性的,则可能是正常的数据的概率偏高,而不相关的则是欺诈数据的概率偏高。将最终的不相关特征数据制作成点集S进行欺诈识别,可以提高欺诈识别的准确性。In the above-described classification analysis module 301 and hybrid analysis module 302, the feature extraction is performed on each type of data described above. For example, the account, login time, transaction number, transaction amount category are first visualized, or related matrix analysis, the first set of irrelevant feature data is obtained, and similarly, the channel, merchant, product information, user IP, etc. The features of the data extraction are visualized, or the correlation matrix is analyzed to obtain a second set of irrelevant feature data; then the first set of irrelevant feature data and the second set of unrelated feature data are correlated, for example, the first group does not The feature A in the related feature data and the second set of irrelevant feature data B are associated with each other, then A and B are eliminated, and the irrelevant irrelevant feature data is retained as the final irrelevant feature data. Because the first set of irrelevant feature data and the second set of irrelevant feature data are already heterogeneous in various types of data, which may be fraudulent data, and then correlation analysis is performed on each data that may be fraudulent data, if there is correlation. The probability of normal data may be high, while the unrelated is the probability of fraudulent data. Making the final irrelevant feature data into a point set S for fraud identification can improve the accuracy of fraud identification.
参照图9,在另一实施例中,上述不相关分析单元30,包括:Referring to FIG. 9, in another embodiment, the above-mentioned irrelevant analysis unit 30 includes:
可视化模块303,用于对所述多个特征数据进行可视化处理;a visualization module 303, configured to perform visual processing on the plurality of feature data;
矩阵分析模块304,用于提取可视化中的离散点对应的特征数据,并将离散点对应的特征数据进行相关矩阵分析,提取各离散点对应的特征数据中没有关联的非关联特征数据,并将所述非关联特征数据记为所述不相关特征数据。The matrix analysis module 304 is configured to extract feature data corresponding to discrete points in the visualization, and perform correlation matrix analysis on the feature data corresponding to the discrete points, and extract non-associated feature data that is not associated in the feature data corresponding to each discrete point, and The non-associated feature data is recorded as the irrelevant feature data.
在上述可视化模块303和矩阵分析模块304中,先对各类数据的特征数据进行可视化处理,选择出各类特征数据中的离散点,进而查找出各离散点对应的特征数据;然后再通过相关矩阵分析的方法分析各离散点对应的特征数据是否相关联,将非关联的特征数据记为上述不相关特征数据。即先通过可视化处理的过程查找出可能是不相关的特征数据,然后再通过矩阵分析的方法对可能是不相关的特征数据再一次处理,得到最终的不相关特征数据,以提高后续识别欺诈数据的准确性。In the above visualization module 303 and the matrix analysis module 304, the feature data of each type of data is first visualized, the discrete points in the various feature data are selected, and the feature data corresponding to each discrete point is searched; The method of matrix analysis analyzes whether the feature data corresponding to each discrete point is associated, and records the non-associated feature data as the above irrelevant feature data. That is, through the process of visual processing, the feature data that may be irrelevant is firstly searched, and then the feature data that may be irrelevant is processed again by the matrix analysis method to obtain the final irrelevant feature data, so as to improve the subsequent identification of fraud data. The accuracy.
参照图10,本实施例中,上述数据欺诈识别装置,还包括:Referring to FIG. 10, in the embodiment, the data fraud identification device further includes:
欺诈等级确定单元50,用于根据预设规则确定该欺诈数据的欺诈等级;a fraud level determining unit 50, configured to determine a fraud level of the fraud data according to a preset rule;
惩罚单元60,用于根据对应的欺诈等级做出对应的惩罚措施。The punishment unit 60 is configured to make corresponding punishment measures according to the corresponding fraud level.
在上述欺诈等级确定单元50和惩罚单元60中,上述Voronoi算法会输出异常因子最小的前n个点,最靠前的点其对应的数据是欺诈数据的概率最高,所以根据其输出的顺序确定欺诈数据的欺诈等级。上述的惩罚措施一般包括报警、罚款、禁封账户等。比如,某一企业的相关数据被提取,然后经过上述方法对提取数据进行分析,如果不存在欺诈数据,则认为该企业是一个信誉良好的企业,如果存在欺诈数据,则判断输出的欺诈数据的个数,输出的欺诈数据个数越多,其信誉越低。还可以根据Voronoi算法输出的数据,反向查找到对应的原始数据,进而分析出企业的欺诈行为,比如,欺诈金额,欺诈行为等。根据其欺诈金额和/或欺诈行为判断是否进行报警处理,或者进行禁封账号等。In the fraud level determining unit 50 and the penalty unit 60, the Voronoi algorithm outputs the first n points with the smallest anomaly factor, and the highest point has the highest probability that the corresponding data is fraudulent data, so it is determined according to the order of its output. The level of fraud in fraudulent data. The above punitive measures generally include alarms, fines, and banned accounts. For example, the relevant data of an enterprise is extracted, and then the extracted data is analyzed by the above method. If there is no fraudulent data, the enterprise is considered to be a reputable enterprise, and if fraud data is present, the fraud data outputted is judged. The number of fraudulent data output, the lower the credibility. According to the data output by the Voronoi algorithm, the corresponding original data can be reversely searched, and then the fraudulent behavior of the enterprise, such as fraud amount, fraud, and the like, can be analyzed. According to the amount of fraud and/or fraud, it is judged whether to handle the alarm, or to prohibit the account number.
在一个具体实施中,A需要到B企业进行业务考察以及签订相关的合作合同。A在去B企业之前先 通过上述的欺诈数据识别方法获取B企业在区块链上的指定时间段内的欺诈数据,如果不存在欺诈数据,则可以选择紧密度较高的合作方式;如果存在欺诈数据,但是欺诈数据较少,如5年内的数据存在一个欺诈数据,则可以选择紧密度一般的合作方式;如果存在较多的欺诈数据,则需要考虑是否与B企业建立合作关系等。In a specific implementation, A needs to go to the B company for business inspection and sign the relevant cooperation contract. Before going to the B enterprise, A obtains the fraud data of the B enterprise in the specified time period on the blockchain through the above fraud data identification method. If there is no fraud data, the cooperation mode with higher tightness can be selected; Fraud data, but the fraud data is less. If there is a fraudulent data in the data within 5 years, you can choose a close cooperation mode; if there is more fraud data, you need to consider whether to establish a cooperative relationship with the B company.
本申请实施例的数据欺诈识别装置,是首次解决企业区块链上数据欺诈数据的识别问题,其利用Voronoi算法,可以将可能是欺诈数据的数据分析出来,从而使企业可以了解与其进行业务往来的其它人或企业的是否可能存在欺诈行为,进而选择适当的合作关系的紧密度等,降低企业与企业,或者个人与企业之间合作被欺骗的风险。The data fraud identification device in the embodiment of the present application is the first to solve the problem of identifying data fraud data in an enterprise blockchain. The Voronoi algorithm can analyze data that may be fraudulent data, so that the enterprise can understand the business with it. Whether other people or enterprises may have fraudulent behavior, and then choose the appropriate degree of cooperation, etc., to reduce the risk of fraud between the enterprise and the enterprise, or the cooperation between the individual and the enterprise.
参照图11,本发明实施例中还提供一种计算机设备,该计算机设备可以是服务器,其内部结构可以如图11所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设计的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机可读指令和数据库。该内存器为非易失性存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的数据库用于存储Voronoi算法模型等数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现上述各方法实施例的流程。Referring to FIG. 11, a computer device is also provided in the embodiment of the present invention. The computer device may be a server, and its internal structure may be as shown in FIG. The computer device includes a processor, memory, network interface, and database connected by a system bus. Among them, the computer designed processor is used to provide calculation and control capabilities. The memory of the computer device includes a non-volatile storage medium, an internal memory. The non-volatile storage medium stores an operating system, computer readable instructions, and a database. The memory provides an environment for the operation of operating systems and computer readable instructions in a non-volatile storage medium. The database of the computer device is used to store data such as the Voronoi algorithm model. The network interface of the computer device is used to communicate with an external terminal via a network connection. The computer readable instructions are executed by a processor to implement the processes of the various method embodiments described above.
本发明一实施例还提供一种计算机非易失性可读存储介质,其上存储有计算机可读指令,计算机可读指令被处理器执行时实现上述各方法实施例的流程。An embodiment of the present invention further provides a computer non-volatile readable storage medium having stored thereon computer readable instructions, which are implemented by a processor to implement the processes of the foregoing method embodiments.
以上所述仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其它相关的技术领域,均同理包括在本申请的专利保护范围内。The above description is only a preferred embodiment of the present application, and thus does not limit the scope of the patent application, and the equivalent structure or equivalent process transformation of the specification and the drawings of the present application, or directly or indirectly applied to other related The technical field is equally included in the scope of patent protection of the present application.

Claims (20)

  1. 一种数据欺诈识别方法,其特征在于,用于区块链上的数据欺诈识别,所述方法,包括:A data fraud identification method, characterized in that it is used for data fraud identification on a blockchain, the method comprising:
    在区块链上获取与指定企业相关的数据;Obtaining data related to the designated enterprise on the blockchain;
    将获取的数据进行特征提取,以得到多个特征数据;Extracting the acquired data to obtain a plurality of feature data;
    在所述多个特征数据中提取出与其它特征数据不相关的特征数据作为不相关特征数据;Extracting feature data not related to other feature data as unrelated feature data in the plurality of feature data;
    通过Voronoi算法对所述不相关特征数据进行异常值识别,得出欺诈数据。The outlier data is identified by the Voronoi algorithm to obtain fraud data.
  2. 根据权利要求1所述的数据欺诈识别方法,其特征在于,所述在所述多个特征数据中提取出与其它特征数据不相关的特征数据作为不相关特征数据的步骤,包括:The data fraud identification method according to claim 1, wherein the step of extracting feature data not related to other feature data as irrelevant feature data in the plurality of feature data comprises:
    将所述多个特征数据可视化处理,将可视化中的离散点对应的特征数据记为所述不相关特征数据。The plurality of feature data are visualized, and the feature data corresponding to the discrete points in the visualization is recorded as the irrelevant feature data.
  3. 根据权利要求2所述的数据欺诈识别方法,其特征在于,所述将所述多个特征数据可视化处理的步骤,包括:The data fraud identification method according to claim 2, wherein the step of visualizing the plurality of feature data comprises:
    将所述多个特征数据制作成散点图。The plurality of feature data is made into a scatter plot.
  4. 根据权利要求1所述的数据欺诈识别方法,其特征在于,所述在所述多个特征数据中提取出与其它特征数据不相关的特征数据作为不相关特征数据的步骤,包括:The data fraud identification method according to claim 1, wherein the step of extracting feature data not related to other feature data as irrelevant feature data in the plurality of feature data comprises:
    将所述多个特征数据进行相关矩阵分析,提取出与其它特征数据不相关的所述不相关特征数据。Performing correlation matrix analysis on the plurality of feature data to extract the irrelevant feature data that is not related to other feature data.
  5. 根据权利要求1所述的数据欺诈识别方法,其特征在于,所述将获取的数据进行特征提取以得到特征数据的步骤,包括:The data fraud identification method according to claim 1, wherein the step of performing feature extraction on the acquired data to obtain feature data comprises:
    根据预设要求对获取的数据进行分类;Classify the acquired data according to preset requirements;
    对各类数据分别进行特征提取。Feature extraction for each type of data.
  6. 根据权利要求5所述的数据欺诈识别方法,其特征在于,所述在所述多个特征数据中提取出与其它特征数据不相关的特征数据作为不相关特征数据的步骤,包括:The data fraud identification method according to claim 5, wherein the step of extracting feature data not related to other feature data as unrelated feature data in the plurality of feature data comprises:
    提取各类数据对应的多个特征数据中的不相关特征数据;Extracting irrelevant feature data of the plurality of feature data corresponding to each type of data;
    将对应各类数据的不相关特征数据混合后进行相关性分析,将不具有相关性的不相关特征数据记为最终的不相关特征数据。Correlation analysis is performed by mixing irrelevant feature data corresponding to various types of data, and irrelevant feature data having no correlation is recorded as final irrelevant feature data.
  7. 根据权利要求1所述的数据欺诈识别方法,其特征在于,所述在所述多个特征数据中提取出与其它特征数据不相关的特征数据作为不相关特征数据的步骤,包括:The data fraud identification method according to claim 1, wherein the step of extracting feature data not related to other feature data as irrelevant feature data in the plurality of feature data comprises:
    对所述多个特征数据进行可视化处理;Visualizing the plurality of feature data;
    提取可视化中的离散点对应的特征数据,并将离散点对应的特征数据进行相关矩阵分析,提取各离散点对应的特征数据中没有关联的非关联特征数据,并将所述非关联特征数据记为所述不相关特征数据。Extracting feature data corresponding to discrete points in the visualization, and performing correlation matrix analysis on the feature data corresponding to the discrete points, extracting non-associated feature data without associated features in the feature data corresponding to each discrete point, and recording the non-associated feature data For the irrelevant feature data.
  8. 一种数据欺诈识别装置,其特征在于,用于区块链上的数据欺诈识别,所述装置,包括:A data fraud identification device, characterized by data fraud identification on a blockchain, the device comprising:
    获取单元,用于在区块链上获取与指定企业相关的数据;An obtaining unit, configured to acquire data related to a specified enterprise on a blockchain;
    特征提取单元,用于将获取的数据进行特征提取,以得到多个特征数据;a feature extraction unit, configured to perform feature extraction on the acquired data to obtain a plurality of feature data;
    不相关分析单元,用于在所述多个特征数据中提取出与其它特征数据不相关的特征数据作为不相关特征数据;An uncorrelated analysis unit, configured to extract, as the unrelated feature data, feature data that is not related to other feature data in the plurality of feature data;
    异常识别单元,用于通过Voronoi算法对所述不相关特征数据进行异常值识别,得出欺诈数据。The abnormality identifying unit is configured to perform an outlier identification on the irrelevant feature data by using a Voronoi algorithm to obtain fraud data.
  9. 根据权利要求8所述的数据欺诈识别装置,其特征在于,所述不相关分析单元,包括:The data fraud identification device according to claim 8, wherein the uncorrelated analysis unit comprises:
    可视化分析模块,用于将所述多个特征数据可视化处理,将可视化中的离散点对应的特征数据记为所述不相关特征数据。The visual analysis module is configured to visualize the plurality of feature data, and record the feature data corresponding to the discrete points in the visualization as the irrelevant feature data.
  10. 根据权利要求9所述的数据欺诈识别装置,其特征在于,所述可视化分析模块,包括:The data fraud detection apparatus according to claim 9, wherein the visual analysis module comprises:
    散点图制作子模块311,用于将所述多个特征数据制作成散点图。The scatter plot creation sub-module 311 is configured to create the plurality of feature data into a scatter plot.
  11. 根据权利要求8所述的数据欺诈识别装置,其特征在于,所述不相关分析单元,包括:The data fraud identification device according to claim 8, wherein the uncorrelated analysis unit comprises:
    相关矩阵分析模块,用于将所述多个特征数据进行相关矩阵分析,提取出与其它特征数据不相关的所述不相关特征数据。The correlation matrix analysis module is configured to perform correlation matrix analysis on the plurality of feature data, and extract the irrelevant feature data that is not related to other feature data.
  12. 根据权利要求8所述的数据欺诈识别装置,其特征在于,所述特征提取单元,包括:The data fraud identification device according to claim 8, wherein the feature extraction unit comprises:
    分类模块,用于根据预设要求对获取的数据进行分类;a classification module, configured to classify the acquired data according to a preset requirement;
    提取模块,用于对各类数据分别进行特征提取。The extraction module is used for feature extraction of various types of data.
  13. 根据权利要求12所述的数据欺诈识别装置,其特征在于,所述不相关分析单元,包括:The data fraud identification device according to claim 12, wherein the uncorrelated analysis unit comprises:
    分类分析模块,用于提取各类数据对应的多个特征数据中的不相关特征数据;a classification analysis module, configured to extract irrelevant feature data of the plurality of feature data corresponding to each type of data;
    混合分析模块,用于将对应各类数据的不相关特征数据混合后进行相关性分析,将不具有相关性的不相关特征数据记为最终的不相关特征数据。The hybrid analysis module is configured to mix correlation data of corresponding types of data and perform correlation analysis, and record irrelevant feature data without correlation as final irrelevant feature data.
  14. 根据权利要求8所述的数据欺诈识别装置,其特征在于,所述不相关分析单元,包括:The data fraud identification device according to claim 8, wherein the uncorrelated analysis unit comprises:
    可视化模块,用于对所述多个特征数据进行可视化处理;a visualization module, configured to visualize the plurality of feature data;
    矩阵分析模块,用于提取可视化中的离散点对应的特征数据,并将离散点对应的特征数据进行相关矩阵分析,提取各离散点对应的特征数据中没有关联的非关联特征数据,并将所述非关联特征数据记为所述不相关特征数据。The matrix analysis module is configured to extract feature data corresponding to discrete points in the visualization, and perform correlation matrix analysis on the feature data corresponding to the discrete points, and extract non-associated feature data that is not associated in the feature data corresponding to each discrete point, and The non-associated feature data is recorded as the irrelevant feature data.
  15. 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机可读指令,其特征在于,所述处理器执行所述计算机可读指令时实现数据欺诈识别方法,用于区块链上的数据欺诈识别,所述方法,包括:A computer device comprising a memory and a processor, the memory storing computer readable instructions, wherein the processor implements a data fraud identification method when the computer readable instructions are executed, for use on a blockchain Data fraud identification, the method comprising:
    在区块链上获取与指定企业相关的数据;Obtaining data related to the designated enterprise on the blockchain;
    将获取的数据进行特征提取,以得到多个特征数据;Extracting the acquired data to obtain a plurality of feature data;
    在所述多个特征数据中提取出与其它特征数据不相关的特征数据作为不相关特征数据;Extracting feature data not related to other feature data as unrelated feature data in the plurality of feature data;
    通过Voronoi算法对所述不相关特征数据进行异常值识别,得出欺诈数据。The outlier data is identified by the Voronoi algorithm to obtain fraud data.
  16. 根据权利要求15所述的计算机设备,其特征在于,所述在所述多个特征数据中提取出与其它特征数据不相关的特征数据作为不相关特征数据的步骤,包括:The computer device according to claim 15, wherein the step of extracting feature data not related to other feature data as irrelevant feature data in the plurality of feature data comprises:
    将所述多个特征数据可视化处理,将可视化中的离散点对应的特征数据记为所述不相关特征数据。The plurality of feature data are visualized, and the feature data corresponding to the discrete points in the visualization is recorded as the irrelevant feature data.
  17. 根据权利要求16所述的计算机设备,其特征在于,所述将所述多个特征数据可视化处理的步骤,包括:The computer device according to claim 16, wherein the step of visualizing the plurality of feature data comprises:
    将所述多个特征数据制作成散点图。The plurality of feature data is made into a scatter plot.
  18. 根据权利要求15所述的计算机设备,其特征在于,所述在所述多个特征数据中提取出与其它特征数据不相关的特征数据作为不相关特征数据的步骤,包括:The computer device according to claim 15, wherein the step of extracting feature data not related to other feature data as irrelevant feature data in the plurality of feature data comprises:
    将所述多个特征数据进行相关矩阵分析,提取出与其它特征数据不相关的所述不相关特征数据。Performing correlation matrix analysis on the plurality of feature data to extract the irrelevant feature data that is not related to other feature data.
  19. 一种计算机非易失性可读存储介质,其上存储有计算机可读指令,其特征在于,所述计算机可读指令被处理器执行时实现数据欺诈识别方法,用于区块链上的数据欺诈识别,所述方法,包括:A computer non-transitory readable storage medium having stored thereon computer readable instructions, wherein the computer readable instructions are implemented by a processor to implement a data fraud identification method for data on a blockchain Fraud identification, the method comprising:
    在区块链上获取与指定企业相关的数据;Obtaining data related to the designated enterprise on the blockchain;
    将获取的数据进行特征提取,以得到多个特征数据;Extracting the acquired data to obtain a plurality of feature data;
    在所述多个特征数据中提取出与其它特征数据不相关的特征数据作为不相关特征数据;Extracting feature data not related to other feature data as unrelated feature data in the plurality of feature data;
    通过Voronoi算法对所述不相关特征数据进行异常值识别,得出欺诈数据。The outlier data is identified by the Voronoi algorithm to obtain fraud data.
  20. 根据权利要求19所述的计算机非易失性可读存储介质,其特征在于,所述在所述多个特征数据中提取出与其它特征数据不相关的特征数据作为不相关特征数据的步骤,包括:The computer non-volatile readable storage medium according to claim 19, wherein said step of extracting feature data unrelated to other feature data as unrelated feature data in said plurality of feature data, include:
    将所述多个特征数据可视化处理,将可视化中的离散点对应的特征数据记为所述不相关特征数据。The plurality of feature data are visualized, and the feature data corresponding to the discrete points in the visualization is recorded as the irrelevant feature data.
PCT/CN2018/095389 2018-04-17 2018-07-12 Data fraud identification method, apparatus, computer device, and storage medium WO2019200739A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810344738.8A CN108665270A (en) 2018-04-17 2018-04-17 Data diddling recognition methods, device, computer equipment and storage medium
CN201810344738.8 2018-04-17

Publications (1)

Publication Number Publication Date
WO2019200739A1 true WO2019200739A1 (en) 2019-10-24

Family

ID=63783647

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/095389 WO2019200739A1 (en) 2018-04-17 2018-07-12 Data fraud identification method, apparatus, computer device, and storage medium

Country Status (2)

Country Link
CN (1) CN108665270A (en)
WO (1) WO2019200739A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111586052A (en) * 2020-05-09 2020-08-25 江苏大学 Multi-level-based crowd sourcing contract abnormal transaction identification method and identification system

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109636577A (en) * 2018-10-25 2019-04-16 深圳壹账通智能科技有限公司 IP address analysis method, device, equipment and computer readable storage medium
CN109697670B (en) * 2018-12-29 2021-06-04 杭州趣链科技有限公司 Public link information shielding method without influence on credibility
CN111598580A (en) * 2020-04-26 2020-08-28 杭州云象网络技术有限公司 XGboost algorithm-based block chain product detection method, system and device
CN111667267B (en) * 2020-05-29 2023-04-18 中国工商银行股份有限公司 Block chain transaction risk identification method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080175226A1 (en) * 2007-01-24 2008-07-24 Secure Computing Corporation Reputation Based Connection Throttling
CN104794192A (en) * 2015-04-17 2015-07-22 南京大学 Multi-level anomaly detection method based on exponential smoothing and integrated learning model
CN105976242A (en) * 2016-04-21 2016-09-28 中国农业银行股份有限公司 Transaction fraud detection method and system based on real-time streaming data analysis
CN107194803A (en) * 2017-05-19 2017-09-22 南京工业大学 A kind of P2P nets borrow the device of borrower's assessing credit risks
CN107785058A (en) * 2017-07-24 2018-03-09 平安科技(深圳)有限公司 Anti- fraud recognition methods, storage medium and the server for carrying safety brain

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080175226A1 (en) * 2007-01-24 2008-07-24 Secure Computing Corporation Reputation Based Connection Throttling
CN104794192A (en) * 2015-04-17 2015-07-22 南京大学 Multi-level anomaly detection method based on exponential smoothing and integrated learning model
CN105976242A (en) * 2016-04-21 2016-09-28 中国农业银行股份有限公司 Transaction fraud detection method and system based on real-time streaming data analysis
CN107194803A (en) * 2017-05-19 2017-09-22 南京工业大学 A kind of P2P nets borrow the device of borrower's assessing credit risks
CN107785058A (en) * 2017-07-24 2018-03-09 平安科技(深圳)有限公司 Anti- fraud recognition methods, storage medium and the server for carrying safety brain

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
QU JILIN, ET AL.: "Outlier Detection Algorithm Based on Voronoi Diagram", COMPUTER ENGINEERING, vol. 33, no. 23, 31 December 2007 (2007-12-31) *
WANG YAN: "Application of Blockchain Technology in Financial Industry and Suggestions for Its Developmebt", HAINAN FINANCE, vol. 12, 31 December 2016 (2016-12-31), ISSN: 1003-9031 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111586052A (en) * 2020-05-09 2020-08-25 江苏大学 Multi-level-based crowd sourcing contract abnormal transaction identification method and identification system

Also Published As

Publication number Publication date
CN108665270A (en) 2018-10-16

Similar Documents

Publication Publication Date Title
Wang et al. Linkage based face clustering via graph convolution network
WO2019200739A1 (en) Data fraud identification method, apparatus, computer device, and storage medium
Bologa et al. Big data and specific analysis methods for insurance fraud detection.
Li et al. A supervised clustering and classification algorithm for mining data with mixed variables
CN113011973B (en) Method and equipment for financial transaction supervision model based on intelligent contract data lake
CN111612041A (en) Abnormal user identification method and device, storage medium and electronic equipment
CN109829721B (en) Online transaction multi-subject behavior modeling method based on heterogeneous network characterization learning
Li et al. Outlier detection using structural scores in a high-dimensional space
JP2020524346A (en) Method, apparatus, computer device, program and storage medium for predicting short-term profits
US7725407B2 (en) Method of measuring a large population of web pages for compliance to content standards that require human judgement to evaluate
CN113364802A (en) Method and device for studying and judging security alarm threat
CN116366313A (en) Small sample abnormal flow detection method and system
Borg et al. Clustering residential burglaries using modus operandi and spatiotemporal information
CN116467666A (en) Graph anomaly detection method and system based on integrated learning and active learning
CN115545103A (en) Abnormal data identification method, label identification method and abnormal data identification device
Gautam et al. Adaptive discretization using golden section to aid outlier detection for software development effort estimation
CN111612531B (en) Click fraud detection method and system
Sukthanker et al. On the importance of architectures and hyperparameters for fairness in face recognition
CN106778252A (en) Intrusion detection method based on rough set theory Yu WAODE algorithms
Nawaiseh et al. Financial Statement Audit using Support Vector Machines, Artificial Neural Networks and K-Nearest Neighbor: An Empirical Study of UK and Ireland
Feng et al. EagleMine: Vision-guided Micro-clusters recognition and collective anomaly detection
Guo et al. Detecting spammers in E-commerce website via spectrum features of user relation graph
Wu et al. Medical insurance fraud recognition based on improved outlier detection algorithm
Knyazeva et al. A graph-based data mining approach to preventing financial fraud: a case study
Madyembwa et al. An Automated Data Pre-processing Technique for Machine Learning in Critical Systems

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18915300

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18915300

Country of ref document: EP

Kind code of ref document: A1