CN112329028A

CN112329028A - Abnormal data identification method, system, device and medium based on block chain

Info

Publication number: CN112329028A
Application number: CN202011049107.7A
Authority: CN
Inventors: 朱佳; 陈善轩; 马晓东; 林志豪
Original assignee: South China Normal University
Current assignee: South China Normal University
Priority date: 2020-09-29
Filing date: 2020-09-29
Publication date: 2021-02-05
Anticipated expiration: 2040-09-29
Also published as: CN112329028B

Abstract

The invention provides a method, a system, a device and a storage medium for identifying abnormal data based on a block chain, wherein the method comprises the following steps: acquiring transaction data, and packaging the transaction data to obtain a block to be audited; uploading the block to be audited to a block chain, and performing transaction verification on the block to be audited; the transaction verification is to determine to receive transaction data of the block to be checked according to the voting result of the machine learning model; broadcasting the transaction verification result on the block chain, packaging transaction data which is subjected to transaction verification to obtain a second block, and accessing the second block to the block chain; according to the method, two classification votes are carried out based on the gradient direction of the transaction data to eliminate the influence of malicious data or abnormal data which can be used by malicious parties, so that the transaction data stored on a block chain is more real and reliable, and the reliability is higher. The invention can be widely applied to the technical field of block chains.

Description

Abnormal data identification method, system, device and medium based on block chain

Technical Field

The invention belongs to the technical field of block chains, and particularly relates to a block chain-based abnormal data identification method, system, device and storage medium.

Background

The block chain network is greatly popularized due to the properties of high transparency, decentralization, distrust, no tampering, anonymity and the like, embodies the idea of distributed autonomy, and gradually receives wide attention of financial institutions with innovation consciousness. Blockchain networks are used as a backbone of a common distributed ledger system to handle asset transactions in the form of digital tokens between point-to-point users, particularly those networks that employ open access policies, which are distinguished by their intrinsic characteristics of non-mediation, public accessibility to network functions and tamper resistance.

Although the block chain has the characteristic that data cannot be tampered with, the data cannot be guaranteed to be authentic and reliable. If the blockchain is considered to be a database, then each event and data needs to be accurately recorded in order for the data stored on the blockchain to be authentic. In addition, in the process of actual production application, there is a situation that a malicious party can use malicious data or abnormal data to participate in the creation process of the block, or data falsification is performed, thereby causing loss.

Disclosure of Invention

In view of the above, to at least partially solve one of the above technical problems, an embodiment of the present invention provides a control method with dual time scales to ensure that transaction data recorded in a block chain are both true and reliable; and simultaneously provides a system, a device and a storage medium which can correspondingly realize the abnormal data identification method based on the block chain.

In a first aspect, the present invention provides a method for identifying abnormal data based on a block chain, which includes the following steps:

acquiring transaction data, and packaging the transaction data to obtain a block to be audited;

uploading the block to be audited to a block chain, and performing transaction verification on the block to be audited; the transaction verification is to determine to receive transaction data of the block to be checked according to the voting result of the machine learning model;

and broadcasting the transaction verification result on the block chain, packaging the transaction data which completes the transaction verification to obtain a second block, and accessing the second block to the block chain.

In some embodiments of the present invention, the step of uploading the to-be-audited block to the block chain and performing transaction verification on the to-be-audited block specifically includes: the transaction data in the block to be audited is sent to the nodes in the block chain; dividing the transaction data, and determining the gradient of the divided transaction data; and carrying out gradient fault tolerance according to random gradient descent with consistent direction in the gradient, and voting nodes based on the result of the gradient fault tolerance.

In some embodiments of the present invention, the step of uploading the block to be reviewed to the block chain and performing transaction verification on the block to be reviewed further includes: and when the node voting result is not less than the preset threshold value, updating the gradient of the node and updating the machine learning model according to the gradient of the transaction data.

In some embodiments of the present invention, the step of dividing the transaction data and determining a gradient of the divided transaction data specifically includes: dividing according to transaction data to obtain a plurality of training data sets, and training according to the training data sets to obtain a gradient descent model; and obtaining the gradient of the transaction data according to the gradient descent model.

In some embodiments of the present invention, the gradient fault tolerance is performed according to a random gradient descent in which the direction of the gradient is consistent, and the node voting is performed based on a result of the gradient fault tolerance, which specifically includes: acquiring gradients of nodes and gradients of adjacent nodes of a plurality of nodes to construct a gradient set; obtaining the scores of the nodes in the gradient set according to the gradient set, and determining the average gradient according to the scores; and training a two-classification model according to the average gradient, and obtaining a voting result of the node according to the two-classification model.

In some embodiments of the present invention, the step of training a binary model according to the average gradient and obtaining a voting result of the node according to the binary model specifically includes: and screening to obtain malicious data according to the average gradient, determining the node of the malicious data as a Byzantine node, and removing the Byzantine node.

In some embodiments of the present invention, the step of obtaining a plurality of training data sets according to the transaction data partitioning further comprises: and acquiring historical data, and dividing the historical data and the transaction data to obtain a plurality of training data sets.

In a second aspect, the technical solution of the present invention further provides a system for identifying abnormal data based on a block chain, including a data processing unit and the block chain; wherein:

the data processing unit is used for acquiring transaction data and packaging the transaction data to obtain a block to be audited; uploading the block to be checked to a block chain;

the block chain is used for receiving the block to be audited and carrying out transaction verification on the block to be audited; the transaction verification is to determine to receive transaction data of the block to be checked according to the voting result of the machine learning model; and broadcasting the transaction verification result on the block chain, packaging the transaction data which completes the transaction verification to obtain a second block, and completing the access.

In a third aspect, a technical solution of the present invention further provides an apparatus for identifying abnormal data based on a block chain, including:

at least one processor;

at least one memory for storing at least one program;

when the at least one program is executed by the at least one processor, the at least one processor implements the block chain based abnormal data identifying method in the first aspect.

In a fourth aspect, the present invention also provides a storage medium in which a processor-executable program is stored, the processor-executable program being configured to implement the method as in the first aspect when executed by a processor.

Advantages and benefits of the present invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention:

the abnormal data identification method based on the block chain provided by the invention selects the consensus result of the machine learning model based on the mode of model voting to judge whether the transaction is received; and two classification votes are carried out based on the gradient direction of the transaction data to eliminate the influence of malicious data or abnormal data which can be used by malicious parties, so that the transaction data stored on the block chain is more real and reliable, and the credibility is higher.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flowchart illustrating the steps of a method for identifying abnormal data based on a blockchain according to an embodiment of the present invention;

FIG. 2 illustrates a preparation phase for transaction verification in accordance with an embodiment of the present invention;

FIG. 3 illustrates the verification phase of transaction verification in an embodiment of the present invention;

FIG. 4 illustrates a consensus phase of transaction verification in accordance with an embodiment of the present invention;

fig. 5 is a schematic diagram illustrating visualization of random gradient descent based on gradient direction coincidence in the embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention. The step numbers in the following embodiments are provided only for convenience of illustration, the order between the steps is not limited at all, and the execution order of each step in the embodiments can be adapted according to the understanding of those skilled in the art.

The general idea of the invention is as follows: considering a blockchain as a database, in order for the data stored on the chain to be authentic, it is necessary to accurately record each event and data. In order to identify abnormal data in a blockchain system, the invention provides an abnormal data identification method in a blockchain environment, which is used for judging whether transaction data is received or not by selecting a consensus result of a machine learning model based on a model voting mode. Considering that a malicious party can use malicious data or abnormal data to influence the training process of a machine learning model, the method adopts a model based on consistent gradient direction, and completes model training in a Byzantine network attack environment in an online training mode.

In a first aspect, as shown in fig. 1, the present embodiment provides a method for identifying abnormal data based on a block chain, which mainly includes steps S01-S03:

and S01, acquiring the transaction data, and packaging the transaction data to obtain the block to be audited. Specifically, the node packages the generated transaction record or transaction data to form a block. More specifically, transaction data and related information cached locally by a node are stored in a block body of a new block, a Merkle tree of the transaction data stored in the block is generated in the block body, and the value of a root of the Merkle tree is stored in a block header; then, in the block head, acquiring a parent hash value, namely the hash value and a random number in the block head of the last block in the block chain; generating a hash value through an SHA256 algorithm, filling the hash value into a block header of a current block, and simultaneously generating a timestamp field; and the block also comprises a difficulty value field which can be adjusted according to the average generation time of the block in a previous period of time so as to deal with the overall calculation total quantity which is continuously changed in the whole network, and if the calculation total quantity is changed, the system can adjust the calculated difficulty value so that the time for expecting to finish the next block is still in a certain time. After the node finishes packaging the transaction data, the node submits the block to a block chain to wait for subsequent auditing, and the block is recorded as an to-be-audited block.

S02, uploading the block to be audited to a block chain, and performing transaction verification on the block to be audited; the transaction verification is to determine to receive transaction data of the block to be checked according to the voting result of the machine learning model; specifically, other nodes in the block chain receive the block to be checked uploaded by the node, and the other nodes verify the transaction data through the trained machine learning model. And other nodes verify the transaction data in the block to be checked one by one through the machine learning model, namely, the transaction record is voted, and the output result of the model of each node represents the voting result. In this embodiment, the machine learning model is a binary classification model, and in some optional embodiments, in the step of performing transaction verification on the block to be checked, when the node voting result is not less than the preset threshold, the gradient of the node is updated according to the gradient of the transaction data, and the machine learning model is updated; i.e. 0 for objection and 1 for approval.

In the implementation process of the method, the machine learning model can be trained in a mode of specifying a data set to carry out transaction verification, or the online model training mode can be carried out in steps S021 to S023 to carry out transaction verification, and for data in the same field, the data obeys independent and same distribution. If malicious data uploaded by a malicious party exists, the distribution of original data is influenced certainly. And the data uploaded by the Byzantine node is malicious data, and the attack of the Byzantine node can occur at any stage. The Byzantine node can upload malicious gradients to cause attacks on the system in a preparation stage, or avoid a defense mechanism to directly modify a final result in an integration stage, or a voting stage supports transmission of malicious voting results to influence consensus. Therefore, the steps S021 to S023 are mainly responsible for identifying malicious data or other operations uploaded by the malicious party, and the malicious data uploaded by the malicious party is prevented from affecting convergence of the model.

And S021, transmitting the transaction data in the block to be audited to the node in the block chain. As shown in fig. 2, in the preparation phase of the algorithm model, the nodes pack the unverified transaction data, send the packed transaction data to each node for receiving and checking, and each node analyzes the packed transaction data to obtain the transaction data. In order to make each node verify the block together, each node extracts a part of data for verification by adopting an equidistant sampling method.

S022, dividing the transaction data and determining the gradient of the divided transaction data. As shown in fig. 3, this step corresponds to a verification stage, in which each node extracts a part of data for verification by using an equidistant sampling method in order to verify the block by all nodes together. After each node receives the to-be-verified block, the gradient generated by the to-be-verified data is calculated, and as a malicious party can use the training of a malicious data influence model or hope to store abnormal data into a block chain, the embodiment provides an algorithm for filtering the abnormal gradient based on the random gradient descent with consistent gradient direction. In the verification stage, after each node receives transaction data to be verified, a specific algorithm for calculating a gradient generated by the data to be verified can be further subdivided into steps S0221-S0223:

s0221, obtaining a plurality of training data sets according to the division of transaction data, and obtaining a gradient descent model according to the training data sets. In this embodiment, the data set of the divided transaction data received by each node is P₁，P₂，...，P_NThe minimum batch size is

The training data set obtained by batch division is

The training round is E, the learning rate of the model is eta, and the latest gradient descent model W_t。

S0222, obtaining the gradient of the transaction data according to the gradient descent model. Based on batch data

With the current model W_tAnd circularly training a gradient descent model, and calculating:

calculating to obtain a gradient lg in which

The derivative function and the parameter η are shown to control the rate of gradient descent. Finally, broadcast gradient values lg on the block chain, and wait for gradient values lg of other nodes₁，lg₂……lg_i，lg_i+1……lg_N。

S023, carrying out gradient fault tolerance according to random gradient descent with consistent direction in the gradient, and voting nodes based on the result of the gradient fault tolerance. As shown in fig. 4, the last stage is a consensus stage, and the gradient fault tolerance is a voting process for determining whether the obtained gradient is abnormal transaction data, the model of each node constitutes a voting committee, the model feedback result is voted, if the number of votes exceeds 1/2, consensus is achieved, and the latest model is recorded and the current latest state is changed. If 1/2 is not exceeded, the node is considered malicious data and is also identified as a byzantine node. More specifically, step S023 may be further subdivided into step S0231-step S0233:

s0231, obtaining the gradients of the nodes and obtaining the gradients of the adjacent nodes of the nodes to construct a gradient set. I.e. the gradient value lg₁，lg₂……lg_i，lg_i+1……lg_NThe number of nodes is N, and the number of Byzantine nodes is f. Calculate nearest neighbor n-f-2 gradients lg_iAnd calculating the nearest gradient set V of each node_i。

S0232, obtaining the scores of the nodes in the gradient set according to the gradient set, and determining the average gradient according to the scores. Calculating the score of each node:

and simultaneously obtain the minimum in the scores:

Min I＝{i|s(i)∈min_k∈Ns(k)}

finally, the set { v } is calculated_iI ∈ MinI }.

S0233, training a binary model according to the average gradient, and obtaining a voting result of the node according to the binary model.

For exampleN honest nodes and f Byzantine nodes (n is more than f) exist on the block chain, and the nodes receive the gradient lg calculated by other nodes₁，lg₂……lg_i，lg_i+1……lg_NThen, the nearest n-f-2 gradients are calculated, the score of each node i is calculated according to the nearest n-f-2 gradients, and finally the gradient average result of each node is used as a consistent consensus result. As shown in fig. 5, the dotted arrow indicates a malicious gradient calculated from the malicious data, and the solid arrow indicates a normal gradient. And the distribution of malicious data is far away from the distribution of original normal data, so that the model is attacked. Therefore, the gradient of malicious data is generally far from the gradient calculated for normal data. Then according to the algorithm the tangential arrow is calculated from the nearest n-f-2 gradients.

In some other embodiments, the step of dividing the transaction data into a plurality of training data sets further comprises: and acquiring historical data, and dividing the historical data and the transaction data to obtain a plurality of training data sets. In order to meet the feature distribution of the training data and avoid the problem of insufficient verification data, the algorithm extracts the existing data from the historical data and supplements the existing data to the verification data. After supplementing the validation data, each node calculates a gradient based on the existing validation data set.

And S03, broadcasting the transaction verification result on the block chain, packaging the transaction data which completes the transaction verification to obtain a second block, and accessing the second block to the block chain. And finally, selecting the transaction data which exceeds 1/2 votes, and repackaging the transaction data into blocks.

In a third aspect, an embodiment of the present invention further provides an apparatus for identifying abnormal data based on a blockchain, where the apparatus includes at least one processor; at least one memory for storing at least one program; when the at least one program is executed by the at least one processor, the at least one processor implements the block chain based abnormal data identifying method as in the first aspect.

An embodiment of the present invention further provides a storage medium storing a program, where the program is executed by a processor as the method in the first aspect.

From the above specific implementation process, it can be concluded that the technical solution provided by the present invention has the following advantages or advantages compared to the prior art:

1. the invention provides a model on-line training method based on gradient direction consistency, which filters abnormal gradients of abnormal data by utilizing the consistency of normal data distribution, thereby training a credible two-classification model.

2. The scheme of the invention has the characteristic that the data cannot be tampered, and simultaneously, the block chain is really trusted, so that the content of the data is real and reliable.

In alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flow charts of the present invention are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed and in which sub-operations described as part of larger operations are performed independently.

Furthermore, although the present invention is described in the context of functional modules, it should be understood that, unless otherwise stated to the contrary, one or more of the functions and/or features may be integrated in a single physical device and/or software module, or one or more of the functions and/or features may be implemented in a separate physical device or software module. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary for an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be understood within the ordinary skill of an engineer, given the nature, function, and internal relationship of the modules. Accordingly, those skilled in the art can, using ordinary skill, practice the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative of and not intended to limit the scope of the invention, which is defined by the appended claims and their full scope of equivalents.

Wherein the functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. The abnormal data identification method based on the block chain is characterized by comprising the following steps:

uploading the block to be checked to a block chain, and performing transaction verification on the block to be checked; the transaction verification is that voting is carried out through a machine learning model according to the gradient direction of the transaction data, and the transaction data of the block to be audited is determined and accepted according to the voting result;

and broadcasting a transaction verification result on the block chain, packaging transaction data which is subjected to transaction verification to obtain a second block, and accessing the second block to the block chain.

2. The method for identifying abnormal data based on block chains according to claim 1, wherein the step of uploading the block to be checked to the block chain and performing transaction verification on the block to be checked specifically comprises:

sending the transaction data in the block to be audited to a node in the block chain;

dividing the transaction data, and determining the gradient of the divided transaction data;

and carrying out gradient fault tolerance according to random gradient descent with consistent direction in the gradient, and voting nodes based on the result of the gradient fault tolerance.

3. The method for identifying abnormal data based on block chain according to claim 2, wherein the step of uploading the block to be checked to the block chain and performing transaction verification on the block to be checked further comprises:

and when the voting result of the node is not less than a preset threshold value, updating the gradient of the node and updating the machine learning model according to the gradient of the transaction data.

4. The method for identifying abnormal data based on a blockchain according to claim 2, wherein the step of dividing the transaction data to determine the gradient of the divided transaction data specifically comprises:

dividing the transaction data to obtain a plurality of training data sets, and training according to the training data sets to obtain a gradient descent model;

and obtaining the gradient of the transaction data according to the gradient descent model.

5. The method according to claim 2, wherein the step of performing gradient fault tolerance according to a random gradient descent with a consistent direction in the gradient and voting nodes based on a result of the gradient fault tolerance specifically comprises:

acquiring gradients of the nodes and gradients of adjacent nodes of a plurality of the nodes to construct a gradient set;

obtaining the scores of the nodes in the gradient set according to the gradient set, and determining the average gradient according to the scores;

and training a two-classification model according to the average gradient, and obtaining the voting result of the node according to the two-classification model.

6. The method according to claim 5, wherein the step of training a binary model according to the average gradient and obtaining the voting result of the node according to the binary model specifically comprises:

and screening to obtain malicious data according to the average gradient, determining the node of the malicious data as a Byzantine node, and removing the Byzantine node.

7. The method for identifying abnormal data based on blockchain according to claim 4, wherein the step of dividing the transaction data into a plurality of training data sets further comprises:

and acquiring historical data, and dividing the historical data and the transaction data to obtain a plurality of training data sets.

8. The abnormal data identification system based on the block chain is characterized by comprising a data processing unit and the block chain; wherein:

the block chain is used for receiving the block to be audited and carrying out transaction verification on the block to be audited; the transaction verification is to determine to accept the transaction data of the block to be checked according to the voting result of the machine learning model; and broadcasting the transaction verification result on the block chain, packaging the transaction data which completes the transaction verification to obtain a second block, and completing the access.

9. An apparatus for identifying abnormal data based on block chains, comprising:

at least one processor;

at least one memory for storing at least one program;

when executed by the at least one processor, cause the at least one processor to implement the method for blockchain-based anomaly data identification according to any one of claims 1 to 7.

10. A storage medium having stored therein a program executable by a processor, characterized in that: the processor-executable program when executed by a processor is for implementing the method of blockchain based anomaly data identification according to any one of claims 1 to 7.