WO2019200739A1

WO2019200739A1 - Data fraud identification method, apparatus, computer device, and storage medium

Info

Publication number: WO2019200739A1
Application number: PCT/CN2018/095389
Authority: WO
Inventors: 王义文; 王健宗; 肖京
Original assignee: 平安科技（深圳）有限公司
Priority date: 2018-04-17
Filing date: 2018-07-12
Publication date: 2019-10-24
Also published as: CN108665270A

Abstract

The present application discloses a data fraud identification method, apparatus, computer device, and storage medium. The method comprises: acquiring data related to a specified business from a blockchain; performing feature extraction on the acquired data to obtain a plurality of feature data; extracting feature data unrelated to other feature data from the plurality of feature data to serve as unrelated feature data; and performing anomaly detection on the unrelated feature data by means of a Voronoi algorithm to obtain fraudulent data.

Description

Data fraud identification method, device, computer device and storage medium

This application claims the priority of the Chinese Patent Application entitled "Data Fraud Identification Method, Apparatus, Computer Equipment, and Storage Media" filed on April 17, 2018, the Chinese Patent Office, Application No. 2018103447388, the entire contents of which are incorporated by reference. Combined in this application.

Technical field

The present application relates to the field of data fraud identification, and in particular to a data fraud identification method, apparatus, computer device and storage medium.

Background technique

Blockchain is a decentralized, trust-free new data architecture that is owned, managed, and supervised by all nodes in the network and does not accept a single aspect of control. There is no data fraud in the blockchain, but there will be fraudulent data formed by malicious bills. How to identify the data on the blockchain is generated by normal transactions or "brushed" is an urgent problem to be solved.

technical problem

The main purpose of the present application is to provide a data fraud identification method, apparatus, computer device and storage medium that can effectively identify fraudulent data on a blockchain.

Technical solution

The present application provides a data fraud identification method for data fraud identification on a blockchain, the method comprising:

Obtaining data related to the designated enterprise on the blockchain;

Extracting the acquired data to obtain a plurality of feature data;

Extracting feature data not related to other feature data as unrelated feature data in the plurality of feature data;

The outlier data is identified by the Voronoi algorithm to obtain fraud data.

The application also provides a data fraud identification device for data fraud identification on a blockchain, the device comprising:

An obtaining unit, configured to acquire data related to a specified enterprise on a blockchain;

a feature extraction unit, configured to perform feature extraction on the acquired data to obtain a plurality of feature data;

An uncorrelated analysis unit, configured to extract, as the unrelated feature data, feature data that is not related to other feature data in the plurality of feature data;

The abnormality identifying unit is configured to perform an outlier identification on the irrelevant feature data by using a Voronoi algorithm to obtain fraud data.

The application further provides a computer device comprising a memory and a processor, the memory storing computer readable instructions, the processor executing the computer readable instructions to implement the steps of any of the methods described above.

The present application also provides a computer non-transitory readable storage medium having stored thereon computer readable instructions that, when executed by a processor, implement the steps of any of the methods described above.

Beneficial effect

The data fraud identification method, device, computer equipment and storage medium of the present application are the first to solve the problem of identifying data fraud data in the enterprise blockchain, and the Voronoi algorithm can analyze the data which may be fraudulent data, thereby enabling the enterprise You can understand whether other people or companies doing business with them may be fraudulent, and then choose the appropriate degree of cooperation.

DRAWINGS

1 is a schematic flowchart of a data fraud identification method according to an embodiment of the present application;

2 is a Voronoi diagram of an embodiment of the present application;

3 is a schematic flowchart of a data fraud identification method according to an embodiment of the present application;

4 is a schematic block diagram showing the structure of a data fraud identification apparatus according to an embodiment of the present application;

FIG. 5 is a schematic block diagram showing the structure of an abnormality identifying unit according to an embodiment of the present application; FIG.

6 is a schematic block diagram showing the structure of an uncorrelated analysis unit according to an embodiment of the present application;

7 is a schematic block diagram showing the structure of an uncorrelated analysis unit according to another embodiment of the present application;

FIG. 8 is a schematic block diagram showing the structure of a data fraud identification apparatus according to an embodiment of the present application; FIG.

9 is a schematic block diagram showing the structure of an uncorrelated analysis unit according to an embodiment of the present application;

FIG. 10 is a schematic block diagram showing the structure of a data fraud identification apparatus according to an embodiment of the present application; FIG.

FIG. 11 is a schematic block diagram showing the structure of a computer device according to an embodiment of the present application.

BEST MODE FOR CARRYING OUT THE INVENTION

Referring to FIG. 1, an embodiment of the present application provides a data fraud identification method for data fraud identification on a blockchain, where the method includes the steps of: S1: acquiring data related to a designated enterprise on a blockchain;

S2, performing feature extraction on the acquired data to obtain multiple feature data;

S3. Extract feature data that is not related to other feature data in the plurality of feature data as irrelevant feature data;

S4. Perform an outlier identification on the irrelevant feature data by using a Voronoi algorithm to obtain fraud data.

As described in the above step S1, the above blockchain is a decentralized, trust-free new data architecture, which is jointly owned, managed, and supervised by all nodes in the network, and does not accept single-party control. A blockchain is a distributed ledger database that manages a growing number of transaction records that are organized into blocks and protected against tampering. Distributed ledger refers to digital ownership records that are different from traditional database technologies (because no central administrator or central data store is required), which can be replicated between different nodes of a peer-to-peer network, and each transaction is made up of a private key. sign. A consensus mechanism is a mechanism for the application of blockchain or distributed ledger technology that does not rely on a central authority to identify and verify a value or transaction. The consensus mechanism is the basis for all blockchain and distributed ledger applications. The above designated enterprise refers to the enterprise to be queried, that is, the enterprise to be queried whether there is fraudulent data on the blockchain. The above related data is generally all data related to the designated enterprise in the blockchain, such as account, amount, date, time, currency, channel, merchant, product information, user IP, device, etc. related to the designated enterprise. The specific method for obtaining the specified data includes inputting a keyword such as a company name and a business scope, and then performing a search on the blockchain to obtain all data related to the search term.

As described in the above step S2, it is the process of feature extraction. In one embodiment, the specific process includes: integrating data, normalizing the data into a data set, collecting it; formatting the data in the data set, cleaning the sampled data, etc.; and then converting the sampled data to obtain the required data. Feature data.

In a specific embodiment, the feature extraction can use the ReliefF algorithm. In 1994, Kononeill was in the Relief algorithm (the Relief algorithm is a feature weighting algorithm, which assigns different features according to the correlation of each feature and category). Algorithms with weights, features whose weights are less than a certain threshold will be removed), which can be processed with respect to the Relief algorithm, can handle multi-category problems. The ReliefF algorithm is used to process regression problems where the target attribute is a continuous value. When dealing with many types of problems, the ReliefF algorithm randomly takes a sample R from the training sample set, and then finds the k neighbor samples (near Hits) from the R-like sample set, from the different classes of each R. In the sample set, find k neighbor samples (near Misses), and then update the weight of each feature, as follows:

In the above formula, diff(A, R ₁ , R ₂ ) represents the difference between the sample R ₁ and the sample R ₂ on the feature A, and the calculation formula, M _j (C) represents the jth nearest neighbor sample in the class C. The following formula:

In another embodiment, feature extraction is performed using the Relief algorithm described above. The Relief algorithm randomly selects a sample R from the training set D, and then searches for the nearest neighbor sample H from the samples of the same type R, called Near Hit, and finds the nearest neighbor sample M from the samples of different R types, called Near Miss. And then update the weight of each feature according to the following rules: If the distance between R and Near Hit on a feature is less than the distance between R and Near Miss, then the feature is beneficial for distinguishing between nearest neighbors of the same type and different classes. Then, the weight of the feature is increased; if the distance between R and Near Hit is greater than the distance between R and Near Miss, indicating that the feature has a negative effect on distinguishing the nearest neighbors of the same type and different classes, then the feature is reduced. Weights. The above process is repeated m times, and finally the average weight of each feature is obtained. The greater the weight of the feature, the stronger the classification ability of the feature, and the weaker the ability to classify the feature. The running time of the Relief algorithm increases linearly with the sampling number m of samples and the number of original features N, so the operating efficiency is very high.

As described in the above step S3, it is to find unrelated feature data in the feature data that is not related to other data. Irrelevant feature data Because it is not related to other feature data, its corresponding raw data may be fraudulent data.

As described in the above step S4, the above Voronoi is also called a Thiessen polygon or a Dirichlet diagram, which is composed of a set of continuous polygons consisting of vertical bisectors connecting straight lines of two adjacent points. N points that differ in plane, dividing the plane according to the nearest neighbor principle; each point is associated with its nearest neighbor. The outliers can be further identified from the above irrelevant feature data by the Voronoi algorithm, and the identified outliers are considered fraud data. The above Voronoi algorithm has lower complexity and faster calculation speed.

In a specific embodiment, all relevant data of the enterprise Y is extracted on the blockchain, such as the account of the enterprise Y, the account login time, the number of transactions, the transaction amount, the channel, the merchant, the product information, the user IP, etc., and then will be obtained. The data is extracted, and then the extracted feature data is subjected to correlation analysis, wherein the feature data not related to other feature data is irrelevant feature data, and the original data corresponding to the unrelated feature data may be fraudulent data. In order to determine the probability that the corresponding original data of the irrelevant feature data is fraud data, the irrelevant feature data is identified by the Voronoi algorithm, and the Voronoi algorithm searches for the outliers in the above irrelevant feature data (according to The rules of the Voronoi algorithm calculate irrelevant feature data that is different from other unrelated feature data, and then sort out the outliers, wherein the first data outputted by the outliers is the most likely to be fraudulent data, and Decrease in turn, in other embodiments, you can set the difference of the first output Corresponding to the value of the raw data is the lowest possibility of fraudulent data, and in turn increased. In the present embodiment, all of the abnormal values may be output, or only the original data corresponding to the abnormal value may be a specified number of abnormal values or the like which is highly likely to be fraudulent data.

Referring to FIG. 2, in the embodiment, the step S4 of obtaining the fraudulent data by using the Voronoi algorithm to perform the outlier identification on the irrelevant feature data includes:

a, the above irrelevant feature data is made into a Voronoi diagram of the point set S;

Among them, each irrelevant feature data is regarded as a point, thereby generating a point set S.

b. Calculate the V-anomaly factor of each point in the point set S, and find the V-adjacent point of each point, specifically: b1, determine the Voronoi polygon V(pi) of a point pi in the point set S Near the point, calculate the average distance of pi to its neighbors, and use the reciprocal of the average distance to measure the abnormal degree of Pi;

B2, for any point p of the point set S, the neighboring point of p determined by the V(p) side is called the V-adjacent point of p, and the set of all V-adjacent points of point p is denoted by V(p).

B3, the reciprocal of the average distance of all V-adjacent points to p at point p, called the V-abnormality factor of point p, denoted as Vd(p),

Where ∣Vd(p)∣ is the number of all V-adjacent points of p;

Vd(p) reflects the distribution density of points around point p. The larger Vd(p), the thinner the distribution of point sets around surface p, and the smaller the anomaly factor.

c, according to the V-anomaly factors of each point from small to large;

d. Output the V-abnormality factor of each point and the first n points with the smallest anomaly factor. The data corresponding to the first n points will be judged as the data with the highest risk of fraudulent data. Because the distribution of point sets around p point is sparser, the anomaly factor is smaller, indicating that the lower the correlation with other data, so the corresponding unrelated feature is the probability that the data is abnormal. The original data of the irrelevant feature data corresponding to the V-abnormality factor is the most probable of fraudulent data. The above n is a preset value and is an integer greater than zero.

In this embodiment, the step S3 of extracting the feature data that is not related to the other feature data as the unrelated feature data in the plurality of feature data includes:

S31: Visualize the plurality of feature data, and record the feature data corresponding to the discrete points in the visualization as the irrelevant feature data.

As described in the above S31, the above-described visualization processing refers to converting the feature data into a graphic or an image on a screen by using computer graphics and image processing techniques. Because the above feature data is visualized, the human can visually distinguish the existence of the discrete points on the graphic or the image, and then select the discrete points, and the computer device records the feature data corresponding to the selected discrete points as irrelevant. Feature data.

The step of visualizing the plurality of feature data includes: forming the plurality of feature data into a scattergram. The above scatter diagram refers to the distribution of data points on the Cartesian coordinate plane in regression analysis; it is usually used to compare aggregated data across categories. The more data you have in a scatter plot, the better the comparison will be. In the embodiment, the feature data is generally a matrix. In this case, a scatter plot matrix can be used to simultaneously draw a scatter plot between the variables, so that the main correlation between multiple variables can be quickly found.

In another embodiment, the step S3 of extracting the feature data that is not related to the other feature data as the unrelated feature data in the plurality of feature data includes:

S32. Perform correlation matrix analysis on the plurality of feature data, and extract the irrelevant feature data that is not related to other feature data.

The above correlation matrix is also called a correlation coefficient matrix, which is composed of correlation coefficients between columns of the matrix. That is to say, the elements of the i-th row and the j-th column of the correlation matrix are the correlation coefficients of the i-th column and the j-th column of the original matrix. In this embodiment, a covariance matrix is generally used for analysis. The covariance is used to measure the overall error of two variables. If the trends of the two variables are consistent, the covariance is a positive value, indicating that the two variables are positively correlated. If the two variables change in opposite directions, the covariance is a negative value, indicating that the two variables are negatively correlated. If the two variables are independent of each other, the covariance is 0, indicating that the two variables are irrelevant. When the variables are greater than or equal to three groups, the corresponding covariance matrix is used.

In another embodiment, the step S2 of performing feature extraction on the acquired data to obtain feature data includes:

S21. Sorting the acquired data according to a preset requirement;

S22. Perform feature extraction on each type of data separately.

As described in steps S21 and S22 above, the preset requirement is a categorization standard, generally classifying data that may have relevance, for example, classifying data such as account, login time, transaction number, transaction amount, and the like. Together, because the correlation between these data is strong, the reason for the strong correlation is that the account is logged in to the corresponding system, the system records the login time, and also records the number of transactions and the amount of each transaction or the total transaction amount. Etc., so there is an association between the data. In this embodiment, the acquired data is first classified, and then the feature extraction is performed on different types of data respectively. The specific feature extraction method is performed by the above-mentioned Relief algorithm for feature extraction, or the feature extraction is performed by the above-described ReliefF algorithm. In this embodiment, the feature data is extracted and classified. First, the feature of each type of feature data is relatively obvious and convenient for extraction. Secondly, the recognition accuracy of fraud data in the later stage can be improved. For example, the step S3 of extracting the feature data that is not related to the other feature data as the unrelated feature data in the plurality of feature data, includes:

S301. Extract unrelated feature data of multiple feature data corresponding to each type of data.

S302: Mixing irrelevant feature data corresponding to each type of data, performing correlation analysis, and recording irrelevant feature data having no correlation as final irrelevant feature data.

As described in the above steps S301 and S302, the feature extraction is performed on each type of data described above. For example, the account, login time, transaction number, transaction amount category are first visualized, or related matrix analysis, the first set of irrelevant feature data is obtained, and similarly, the channel, merchant, product information, user IP, etc. The features of the data extraction are visualized, or the correlation matrix is analyzed to obtain a second set of irrelevant feature data; then the first set of irrelevant feature data and the second set of unrelated feature data are correlated, for example, the first group does not The feature A in the related feature data and the second set of irrelevant feature data B are associated with each other, then A and B are eliminated, and the irrelevant irrelevant feature data is retained as the final irrelevant feature data. Because the first set of irrelevant feature data and the second set of irrelevant feature data are already heterogeneous in various types of data, which may be fraudulent data, and then correlation analysis is performed on each data that may be fraudulent data, if there is correlation. The probability of normal data may be high, while the unrelated is the probability of fraudulent data. Making the final irrelevant feature data into a point set S for fraud identification can improve the accuracy of fraud identification.

In another embodiment, the step S3 of extracting feature data that is not related to other feature data in the plurality of feature data as unrelated feature data is included, including

S303. Perform visual processing on the multiple feature data.

S304. Extract feature data corresponding to discrete points in the visualization, and perform correlation matrix analysis on the feature data corresponding to the discrete points, and extract non-associated feature data that is not associated in the feature data corresponding to each discrete point, and extract the non-associated feature. The data is recorded as the irrelevant feature data.

As described in steps S303 and S304 above, the feature data of each type of data is first visualized, discrete points in each type of feature data are selected, and feature data corresponding to each discrete point is searched; and then analyzed by correlation matrix. The method analyzes whether the feature data corresponding to each discrete point is associated, and records the non-associated feature data as the above irrelevant feature data. That is, through the process of visual processing, the feature data that may be irrelevant is firstly searched, and then the feature data that may be irrelevant is processed again by the matrix analysis method to obtain the final irrelevant feature data, so as to improve the subsequent identification of fraud data. The accuracy.

Referring to FIG. 3, in the embodiment, after the step S4 of obtaining fraud data by using the Voronoi algorithm to perform the outlier identification on the irrelevant feature data, the method includes:

S5. Determine a fraud level of the fraud data according to a preset rule.

S6. Perform corresponding punishment measures according to the corresponding fraud level.

As described in steps S5 and S6 above, the Voronoi algorithm outputs the first n points with the smallest anomaly factor, and the highest point has the highest probability that the corresponding data is fraudulent data, so the fraud of the fraud data is determined according to the order of its output. grade. The above punitive measures generally include alarms, fines, and banned accounts. For example, the relevant data of an enterprise is extracted, and then the extracted data is analyzed by the above method. If there is no fraudulent data, the enterprise is considered to be a reputable enterprise, and if fraud data is present, the fraud data outputted is judged. The number of fraudulent data output, the lower the credibility. According to the data output by the Voronoi algorithm, the corresponding original data can be reversely searched, and then the fraudulent behavior of the enterprise, such as fraud amount, fraud, and the like, can be analyzed. According to the amount of fraud and/or fraud, it is judged whether to handle the alarm, or to prohibit the account number.

In a specific implementation, A needs to go to the B company for business inspection and sign the relevant cooperation contract. Before going to the B enterprise, A obtains the fraud data of the B enterprise in the specified time period on the blockchain through the above fraud data identification method. If there is no fraud data, the cooperation mode with higher tightness can be selected; Fraud data, but the fraud data is less. If there is a fraudulent data in the data within 5 years, you can choose a close cooperation mode; if there is more fraud data, you need to consider whether to establish a cooperative relationship with the B company.

The data fraud identification method in the embodiment of the present application is the first to solve the problem of identifying data fraud data in an enterprise blockchain. The Voronoi algorithm can analyze data that may be fraudulent data, so that the enterprise can understand the business relationship with it. Whether other people or enterprises may have fraudulent behavior, and then choose the appropriate degree of cooperation, etc., to reduce the risk of fraud between the enterprise and the enterprise, or the cooperation between the individual and the enterprise.

Referring to FIG. 4, an embodiment of the present application further provides a data fraud identification apparatus for data fraud identification on a blockchain, where the apparatus includes:

The obtaining unit 10 is configured to acquire data related to the designated enterprise on the blockchain;

The feature extraction unit 20 is configured to perform feature extraction on the acquired data to obtain a plurality of feature data.

The uncorrelated analysis unit 30 is configured to extract, as the unrelated feature data, feature data that is not related to other feature data in the plurality of feature data;

The abnormality identifying unit 40 is configured to perform abnormal value identification on the irrelevant feature data by using a Voronoi algorithm to obtain fraud data.

In the above obtaining unit 10, the above blockchain is a decentralized, trust-free new data architecture, which is jointly owned, managed, and supervised by all nodes in the network, and does not accept single-party control. A blockchain is a distributed ledger database that manages a growing number of transaction records that are organized into blocks and protected against tampering. Distributed ledger refers to digital ownership records that are different from traditional database technologies (because no central administrator or central data store is required), which can be replicated between different nodes of a peer-to-peer network, and each transaction is made up of a private key. sign. A consensus mechanism is a mechanism for the application of blockchain or distributed ledger technology that does not rely on a central authority to identify and verify a value or transaction. The consensus mechanism is the basis for all blockchain and distributed ledger applications. The above designated enterprise refers to the enterprise to be queried, that is, the enterprise to be queried whether there is fraudulent data on the blockchain. The above related data is generally all data related to the designated enterprise in the blockchain, such as account, amount, date, time, currency, channel, merchant, product information, user IP, device, etc. related to the designated enterprise. The specific method for obtaining the specified data includes inputting a keyword such as a company name and a business scope, and then performing a search on the blockchain to obtain all data related to the search term.

In the feature extraction unit 20 described above, it is a unit that completes feature extraction. In one embodiment, the specific process includes: integrating data, normalizing the data into a data set, collecting it; formatting the data in the data set, cleaning the sampled data, etc.; and then converting the sampled data to obtain the required data. Feature data.

In a specific embodiment, feature extraction by feature extraction unit 20 uses the ReliefF algorithm, which was Kononeill's Relief algorithm in 1994 (Relief algorithm is a feature weighting algorithm, based on correlations of various features and categories). An algorithm that gives improved weights with features that are less than a certain threshold will be removed), which can handle multi-category problems relative to the Relief algorithm. The ReliefF algorithm is used to process regression problems where the target attribute is a continuous value. When dealing with many types of problems, the ReliefF algorithm randomly takes a sample R from the training sample set, and then finds the k neighbor samples (near Hits) from the R-like sample set, from the different classes of each R. In the sample set, k neighbor samples are found, and then the weight of each feature is updated. The specific process is described in the foregoing method embodiment, and details are not described herein.

In another embodiment, feature extraction unit 20 performs feature extraction using the Relief algorithm described above. The Relief algorithm randomly selects a sample R from the training set D, and then searches for the nearest neighbor sample H from the samples of the same type R, called Near Hit, and finds the nearest neighbor sample M from the samples of different R types, called NearMiss. Then update the weight of each feature according to the following rules: If the distance between R and Near Hit on a feature is less than the distance between R and Near Miss, then the feature is beneficial for distinguishing between nearest neighbors of the same type and different classes, then Increasing the weight of the feature; conversely, if the distance between R and Near Hit is greater than the distance between R and Near Miss, indicating that the feature has a negative effect on distinguishing between nearest neighbors of the same type and different classes, then the weight of the feature is reduced. . The above process is repeated m times, and finally the average weight of each feature is obtained. The greater the weight of the feature, the stronger the classification ability of the feature, and the weaker the ability to classify the feature. The running time of the Relief algorithm increases linearly with the sampling number m of samples and the number of original features N, so the operating efficiency is very high.

In the above-described irrelevant analysis unit 30, it is a unit for finding unrelated feature data of the above feature data that is not related to other data. Since the irrelevant feature data is not related to other feature data, its corresponding original data may be fraudulent data.

In the abnormality identifying unit 40 described above, the above Voronoi is also called a Tyson polygon or a Dirichlet graph, which is composed of a set of continuous polygons consisting of vertical bisectors connecting straight lines of two adjacent points. N points that differ in plane, dividing the plane according to the nearest neighbor principle; each point is associated with its nearest neighbor. The outliers can be further identified from the above irrelevant feature data by the Voronoi algorithm, and the identified outliers are considered fraud data. The above Voronoi algorithm has lower complexity and faster calculation speed.

In a specific embodiment, the obtaining unit 10 extracts all relevant data of the enterprise Y on the blockchain, such as the account of the enterprise Y, the account login time, the number of transactions, the transaction amount, the channel, the merchant, the product information, the user IP, and the like. Then, the feature extraction unit 20 performs feature extraction on the obtained data, and then the uncorrelated analysis unit 30 performs correlation analysis on the extracted plurality of feature data, wherein the feature data that is not related to other feature data is irrelevant feature data, and is not The original data corresponding to the related feature data may be fraudulent data. In order to determine the probability that the corresponding original data of the irrelevant feature data is fraud data, the abnormality identifying unit 40 performs an abnormal value on the irrelevant feature data by the Voronoi algorithm (according to Voronoi). The rule of the algorithm calculates the irrelevant feature data that is different from other unrelated feature data. The Voronoi algorithm finds the outliers in the above unrelated feature data, and then sorts the outliers, the first of which is output. The raw data corresponding to the outliers is the most likely to be fraudulent data. And successively reduced, in other embodiments, the raw data may be set to an abnormal value corresponding to the first output is the lowest possibility of fraudulent data, and sequentially raised. In the present embodiment, all of the abnormal values may be output, or only the original data corresponding to the abnormal value may be a specified number of abnormal values or the like which is highly likely to be fraudulent data.

Referring to FIG. 5 and FIG. 2, in the embodiment, the abnormality identifying unit 40 includes:

The graphics module 41 is configured to generate the above-mentioned irrelevant feature data into a Voronoi diagram of the point set S; wherein each irrelevant feature data is regarded as a point, thereby generating a point set S.

The calculation module 42 is configured to calculate a V-abnormality factor of each point in the point set S, and find a V-adjacent point of each point. The specific execution process of the calculation module 42 includes: determining the neighboring point of the Voronoi polygon V(pi) of a point pi in the point set S, calculating the average distance of the pi to its neighboring points, and measuring the Pi by the reciprocal of the average distance. The degree of abnormality; for any point p of the point set S, the neighboring point of p determined by the V(p) side is called the V-adjacent point of p, and the set of all V-adjacent points of point p is denoted by V(p). Point p, the reciprocal of the average distance from all V-adjacent points to p, called the V-anomaly factor of point p, denoted Vd(p),

Where ∣Vd(p)∣ is the number of all V-adjacent points of p;

Arrangement module 43 for arranging the V-anomaly factors of each point from small to large;

The output module 44 is configured to output a V-abnormality factor of each point and a first n points with the smallest anomaly factor, and the data corresponding to the first n points is determined to be the data with the highest risk of fraudulent data. Because the distribution of point sets around p point is sparser, the anomaly factor is smaller, indicating that the lower the correlation with other data, so the corresponding unrelated feature is the probability that the data is abnormal. The original data of the irrelevant feature data corresponding to the V-abnormality factor is the most probable of fraudulent data. The above n is a preset value and is an integer greater than zero.

Referring to FIG. 6, in the embodiment, the uncorrelated analysis unit 30 includes:

The visual analysis module 31 is configured to visualize the plurality of feature data, and record the feature data corresponding to the discrete points in the visualization as the irrelevant feature data.

In the above-described visual analysis module 31, the above-described visualization processing refers to converting the feature data into a graphic or an image on a screen by using computer graphics and image processing techniques. Because the above feature data is visualized, the human can visually distinguish the existence of the discrete points on the graphic or the image, and then select the discrete points, and the computer device records the feature data corresponding to the selected discrete points as irrelevant. Feature data.

The visual analysis module 31 includes a scatter plot creation sub-module 311 for creating the plurality of feature data into a scatter plot. The above scatter diagram refers to the distribution of data points on the Cartesian coordinate plane in regression analysis; it is usually used to compare aggregated data across categories. The more data you have in a scatter plot, the better the comparison will be. In the embodiment, the feature data is generally a matrix. In this case, a scatter plot matrix can be used to simultaneously draw a scatter plot between the variables, so that the main correlation between multiple variables can be quickly found.

Referring to FIG. 7, in another embodiment, the foregoing irrelevant analysis unit 30 includes:

The correlation matrix analysis module 32 is configured to perform correlation matrix analysis on the plurality of feature data, and extract the irrelevant feature data that is not related to other feature data.

In the correlation matrix analysis module 32, the above correlation matrix is also called a correlation coefficient matrix, which is composed of correlation coefficients between columns of the matrix. That is to say, the elements of the i-th row and the j-th column of the correlation matrix are the correlation coefficients of the i-th column and the j-th column of the original matrix. In this embodiment, a covariance matrix is generally used for analysis. The covariance is used to measure the overall error of two variables. If the trends of the two variables are consistent, the covariance is a positive value, indicating that the two variables are positively correlated. If the two variables change in opposite directions, the covariance is a negative value, indicating that the two variables are negatively correlated. If the two variables are independent of each other, the covariance is 0, indicating that the two variables are irrelevant. When the variables are greater than or equal to three groups, the corresponding covariance matrix is used.

Referring to FIG. 8, in another embodiment, the feature extraction unit 20 includes:

a classification module 21, configured to classify the acquired data according to a preset requirement;

The extracting module 22 is configured to perform feature extraction on each type of data separately.

In the above classification module 21 and the extraction module 22, the preset requirement is a classification standard, and generally the data that may have relevance is classified into one category, for example, data such as an account, a login time, a transaction number, a transaction amount, and the like. Classified together, because the correlation between these data is strong, the reason for the strong correlation is to log in to the corresponding system through the account, the system records the login time, and also records the number of transactions and the amount or total of each transaction. The amount of the transaction, etc., so there is an association between the data. In this embodiment, the acquired data is first classified, and then the feature extraction is performed on different types of data respectively. The specific feature extraction method is performed by the above-mentioned Relief algorithm for feature extraction, or the feature extraction is performed by the above-described ReliefF algorithm. In this embodiment, the feature data is extracted and classified. First, the feature of each type of feature data is relatively obvious and convenient for extraction. Secondly, the recognition accuracy of fraud data in the later stage can be improved. For example, the subsequent uncorrelated analysis unit 30 includes:

The classification analysis module 301 is configured to extract irrelevant feature data of the plurality of feature data corresponding to the various types of data;

The hybrid analysis module 302 is configured to mix the uncorrelated feature data corresponding to the various types of data, perform correlation analysis, and record the irrelevant feature data without correlation as the final irrelevant feature data.

In the above-described classification analysis module 301 and hybrid analysis module 302, the feature extraction is performed on each type of data described above. For example, the account, login time, transaction number, transaction amount category are first visualized, or related matrix analysis, the first set of irrelevant feature data is obtained, and similarly, the channel, merchant, product information, user IP, etc. The features of the data extraction are visualized, or the correlation matrix is analyzed to obtain a second set of irrelevant feature data; then the first set of irrelevant feature data and the second set of unrelated feature data are correlated, for example, the first group does not The feature A in the related feature data and the second set of irrelevant feature data B are associated with each other, then A and B are eliminated, and the irrelevant irrelevant feature data is retained as the final irrelevant feature data. Because the first set of irrelevant feature data and the second set of irrelevant feature data are already heterogeneous in various types of data, which may be fraudulent data, and then correlation analysis is performed on each data that may be fraudulent data, if there is correlation. The probability of normal data may be high, while the unrelated is the probability of fraudulent data. Making the final irrelevant feature data into a point set S for fraud identification can improve the accuracy of fraud identification.

Referring to FIG. 9, in another embodiment, the above-mentioned irrelevant analysis unit 30 includes:

a visualization module 303, configured to perform visual processing on the plurality of feature data;

The matrix analysis module 304 is configured to extract feature data corresponding to discrete points in the visualization, and perform correlation matrix analysis on the feature data corresponding to the discrete points, and extract non-associated feature data that is not associated in the feature data corresponding to each discrete point, and The non-associated feature data is recorded as the irrelevant feature data.

In the above visualization module 303 and the matrix analysis module 304, the feature data of each type of data is first visualized, the discrete points in the various feature data are selected, and the feature data corresponding to each discrete point is searched; The method of matrix analysis analyzes whether the feature data corresponding to each discrete point is associated, and records the non-associated feature data as the above irrelevant feature data. That is, through the process of visual processing, the feature data that may be irrelevant is firstly searched, and then the feature data that may be irrelevant is processed again by the matrix analysis method to obtain the final irrelevant feature data, so as to improve the subsequent identification of fraud data. The accuracy.

Referring to FIG. 10, in the embodiment, the data fraud identification device further includes:

a fraud level determining unit 50, configured to determine a fraud level of the fraud data according to a preset rule;

The punishment unit 60 is configured to make corresponding punishment measures according to the corresponding fraud level.

In the fraud level determining unit 50 and the penalty unit 60, the Voronoi algorithm outputs the first n points with the smallest anomaly factor, and the highest point has the highest probability that the corresponding data is fraudulent data, so it is determined according to the order of its output. The level of fraud in fraudulent data. The above punitive measures generally include alarms, fines, and banned accounts. For example, the relevant data of an enterprise is extracted, and then the extracted data is analyzed by the above method. If there is no fraudulent data, the enterprise is considered to be a reputable enterprise, and if fraud data is present, the fraud data outputted is judged. The number of fraudulent data output, the lower the credibility. According to the data output by the Voronoi algorithm, the corresponding original data can be reversely searched, and then the fraudulent behavior of the enterprise, such as fraud amount, fraud, and the like, can be analyzed. According to the amount of fraud and/or fraud, it is judged whether to handle the alarm, or to prohibit the account number.

The data fraud identification device in the embodiment of the present application is the first to solve the problem of identifying data fraud data in an enterprise blockchain. The Voronoi algorithm can analyze data that may be fraudulent data, so that the enterprise can understand the business with it. Whether other people or enterprises may have fraudulent behavior, and then choose the appropriate degree of cooperation, etc., to reduce the risk of fraud between the enterprise and the enterprise, or the cooperation between the individual and the enterprise.

Referring to FIG. 11, a computer device is also provided in the embodiment of the present invention. The computer device may be a server, and its internal structure may be as shown in FIG. The computer device includes a processor, memory, network interface, and database connected by a system bus. Among them, the computer designed processor is used to provide calculation and control capabilities. The memory of the computer device includes a non-volatile storage medium, an internal memory. The non-volatile storage medium stores an operating system, computer readable instructions, and a database. The memory provides an environment for the operation of operating systems and computer readable instructions in a non-volatile storage medium. The database of the computer device is used to store data such as the Voronoi algorithm model. The network interface of the computer device is used to communicate with an external terminal via a network connection. The computer readable instructions are executed by a processor to implement the processes of the various method embodiments described above.

An embodiment of the present invention further provides a computer non-volatile readable storage medium having stored thereon computer readable instructions, which are implemented by a processor to implement the processes of the foregoing method embodiments.

The above description is only a preferred embodiment of the present application, and thus does not limit the scope of the patent application, and the equivalent structure or equivalent process transformation of the specification and the drawings of the present application, or directly or indirectly applied to other related The technical field is equally included in the scope of patent protection of the present application.

Claims

A data fraud identification method, characterized in that it is used for data fraud identification on a blockchain, the method comprising:

Obtaining data related to the designated enterprise on the blockchain;

Extracting the acquired data to obtain a plurality of feature data;

Extracting feature data not related to other feature data as unrelated feature data in the plurality of feature data;

The outlier data is identified by the Voronoi algorithm to obtain fraud data.
The data fraud identification method according to claim 1, wherein the step of extracting feature data not related to other feature data as irrelevant feature data in the plurality of feature data comprises:

The plurality of feature data are visualized, and the feature data corresponding to the discrete points in the visualization is recorded as the irrelevant feature data.
The data fraud identification method according to claim 2, wherein the step of visualizing the plurality of feature data comprises:

The plurality of feature data is made into a scatter plot.
The data fraud identification method according to claim 1, wherein the step of extracting feature data not related to other feature data as irrelevant feature data in the plurality of feature data comprises:

Performing correlation matrix analysis on the plurality of feature data to extract the irrelevant feature data that is not related to other feature data.
The data fraud identification method according to claim 1, wherein the step of performing feature extraction on the acquired data to obtain feature data comprises:

Classify the acquired data according to preset requirements;

Feature extraction for each type of data.
The data fraud identification method according to claim 5, wherein the step of extracting feature data not related to other feature data as unrelated feature data in the plurality of feature data comprises:

Extracting irrelevant feature data of the plurality of feature data corresponding to each type of data;

Correlation analysis is performed by mixing irrelevant feature data corresponding to various types of data, and irrelevant feature data having no correlation is recorded as final irrelevant feature data.
The data fraud identification method according to claim 1, wherein the step of extracting feature data not related to other feature data as irrelevant feature data in the plurality of feature data comprises:

Visualizing the plurality of feature data;

Extracting feature data corresponding to discrete points in the visualization, and performing correlation matrix analysis on the feature data corresponding to the discrete points, extracting non-associated feature data without associated features in the feature data corresponding to each discrete point, and recording the non-associated feature data For the irrelevant feature data.
A data fraud identification device, characterized by data fraud identification on a blockchain, the device comprising:

An obtaining unit, configured to acquire data related to a specified enterprise on a blockchain;

a feature extraction unit, configured to perform feature extraction on the acquired data to obtain a plurality of feature data;

An uncorrelated analysis unit, configured to extract, as the unrelated feature data, feature data that is not related to other feature data in the plurality of feature data;

The abnormality identifying unit is configured to perform an outlier identification on the irrelevant feature data by using a Voronoi algorithm to obtain fraud data.
The data fraud identification device according to claim 8, wherein the uncorrelated analysis unit comprises:

The visual analysis module is configured to visualize the plurality of feature data, and record the feature data corresponding to the discrete points in the visualization as the irrelevant feature data.
The data fraud detection apparatus according to claim 9, wherein the visual analysis module comprises:

The scatter plot creation sub-module 311 is configured to create the plurality of feature data into a scatter plot.
The data fraud identification device according to claim 8, wherein the uncorrelated analysis unit comprises:

The correlation matrix analysis module is configured to perform correlation matrix analysis on the plurality of feature data, and extract the irrelevant feature data that is not related to other feature data.
The data fraud identification device according to claim 8, wherein the feature extraction unit comprises:

a classification module, configured to classify the acquired data according to a preset requirement;

The extraction module is used for feature extraction of various types of data.
The data fraud identification device according to claim 12, wherein the uncorrelated analysis unit comprises:

a classification analysis module, configured to extract irrelevant feature data of the plurality of feature data corresponding to each type of data;

The hybrid analysis module is configured to mix correlation data of corresponding types of data and perform correlation analysis, and record irrelevant feature data without correlation as final irrelevant feature data.
The data fraud identification device according to claim 8, wherein the uncorrelated analysis unit comprises:

a visualization module, configured to visualize the plurality of feature data;

The matrix analysis module is configured to extract feature data corresponding to discrete points in the visualization, and perform correlation matrix analysis on the feature data corresponding to the discrete points, and extract non-associated feature data that is not associated in the feature data corresponding to each discrete point, and The non-associated feature data is recorded as the irrelevant feature data.
A computer device comprising a memory and a processor, the memory storing computer readable instructions, wherein the processor implements a data fraud identification method when the computer readable instructions are executed, for use on a blockchain Data fraud identification, the method comprising:

Obtaining data related to the designated enterprise on the blockchain;

Extracting the acquired data to obtain a plurality of feature data;

Extracting feature data not related to other feature data as unrelated feature data in the plurality of feature data;

The outlier data is identified by the Voronoi algorithm to obtain fraud data.
The computer device according to claim 15, wherein the step of extracting feature data not related to other feature data as irrelevant feature data in the plurality of feature data comprises:

The plurality of feature data are visualized, and the feature data corresponding to the discrete points in the visualization is recorded as the irrelevant feature data.
The computer device according to claim 16, wherein the step of visualizing the plurality of feature data comprises:

The plurality of feature data is made into a scatter plot.
The computer device according to claim 15, wherein the step of extracting feature data not related to other feature data as irrelevant feature data in the plurality of feature data comprises:

Performing correlation matrix analysis on the plurality of feature data to extract the irrelevant feature data that is not related to other feature data.
A computer non-transitory readable storage medium having stored thereon computer readable instructions, wherein the computer readable instructions are implemented by a processor to implement a data fraud identification method for data on a blockchain Fraud identification, the method comprising:

Obtaining data related to the designated enterprise on the blockchain;

Extracting the acquired data to obtain a plurality of feature data;

Extracting feature data not related to other feature data as unrelated feature data in the plurality of feature data;

The outlier data is identified by the Voronoi algorithm to obtain fraud data.
The computer non-volatile readable storage medium according to claim 19, wherein said step of extracting feature data unrelated to other feature data as unrelated feature data in said plurality of feature data, include:

The plurality of feature data are visualized, and the feature data corresponding to the discrete points in the visualization is recorded as the irrelevant feature data.